Parallel operation device allowing efficient parallel operational processing

ABSTRACT

In arithmetic/logic units (ALU) provided corresponding to entries, an MIMD instruction decoder generating a group of control signals in accordance with a Multiple Instruction-Multiple Data (MIMD) instruction and an MIMD register storing data designating the MIMD instruction are provided, and an inter-ALU communication circuit is provided. The amount and direction of movement of the inter-ALU communication circuit are set by data bits stored in a movement data register. It is possible to execute data movement and arithmetic/logic operation with the amount of movement and operation instruction set individually for each ALU unit. Therefore, in a Single Instruction-Multiple Data type processing device, Multiple Instruction-Multiple Data operation can be executed at high speed in a flexible manner.

This application is a continuation of U.S. application Ser. No.11/840,116, filed Aug. 16, 2007, the content of which is hereinincorporated in its entirety by reference.

BACKGROUND OF THE INVENTION

1. Field of the Invention

The present invention relates to a semiconductor processing device and,more specifically, to a configuration of a processing circuit performingarithmetic/logic operations on a large amount of data at high speedusing semiconductor memories.

2. Description of the Background Art

Recently, along with wide spread use of portable terminal equipment,digital signal processing allowing high speed processing of a largeamount of data such as voice data and image data comes to have higherimportance. For such digital signal processing, generally, a DSP(Digital Signal Processor) is used as a dedicated semiconductor device.Digital signal processing of voice and image includes data processingsuch as filtering, which in turn frequently requires arithmeticoperations with repetitive sum-of-products operations. Therefore, a DSPis generally configured to have a multiplication circuit, an addercircuit and a register for accumulation. When such a dedicated DSP isused the sum-of-products operation can be executed in one machine cycle,enabling a high-speed arithmetic/logic operation.

When the amount of data to be processed is very large, however, even adedicated DSP is insufficient to attain dramatic improvement inperformance. By way of example, when the data to be operated assume10,000 sets and an operation of each data set can be executed in onemachine cycle, at least 10,000 cycles are necessary to finish theoperation. Therefore, though each process can be done at high speed inan arrangement in which the sum-of-products operation is done using aregister file, when the amount of data increases, the time of processingincreases in proportion thereto as the data are processed in series, andtherefore, such an arrangement cannot achieve high speed processing.

When such a dedicated DSP is used, the processing performance muchdepends on operating frequency, and therefore, if high speed processingis given priority, power consumption would considerably be increased.

In view of the foregoing, the applicant of the present invention hasalready proposed a configuration allowing arithmetic/logic operations ona large amount of data at high speed (Reference 1 (Japanese PatentLaying-Open No. 2006-127460)).

In the configuration described in Reference 1, a memory cell mat isdivided into a plurality of entries, and an arithmetic logic unit (ALU)is arranged corresponding to each entry. Between the entries and thecorresponding arithmetic logic units (ALUs), data are transferred inbit-serial manner, and operations are executed in parallel among aplurality of entries. For a binary operation, for example, data of twoterms are read, operated and the result of operation is stored. Suchoperation on data is executed on bit-by-bit basis. Assuming that reading(load), operation and writing (store) of the operation result eachrequire one machine cycle and the data word of the operation target hasthe bit width N, operation of each entry requires 4×N machine cycles.The data word of the operation target generally has the bit width of 8to 64 bits. Therefore, when the number of entries is set relativelylarge to 1024 and data of 8-bit width are to be processed in parallel,1024 results of arithmetic operations can be obtained after 32 machinecycles. Thus, necessary time of processing can significantly be reducedas compared with sequential processing of 1024 sets of data.

Further, in the configuration disclosed in Reference 1, data transfercircuits are provided corresponding to the entries. Inter-ALU connectingswitch circuit (data transfer circuit: ECM (entry communicator)) isprovided for data transfer between processors (ALUs), whereby data aretransferred through dedicated buses among the entries. Therefore, ascompared with a configuration in which data are transferred betweenentries through a system bus, arithmetic/logic operations can beexecuted with high-speed data transfer. Further, use of the inter-ALUconnecting switch circuit achieves operations on data stored in variousregions in the memory cell mat, whereby degree of freedom in operationcan be increased, and a semiconductor processing device performingvarious operations can be realized.

In the configuration described in Reference 1, it is possible to executeone same arithmetic/logic operation in parallel in processors among allentries of the memory mat. Specifically, the parallel processing device(MTX) described in Reference 1 is a processing device based on an SIMD(Single Instruction Stream Multiple Data Stream) architecture. Further,it uses the inter-ALU connecting switch circuit, so that communicationsbetween physically apart entries can be executed simultaneously in eachentry, and processes over entries can also be executed.

In the configuration described in Reference 1, it is possible to executea pointer register instruction for operating contents of a pointerregister representing an access location in the memory cell mat, a 1-bitload/store instruction, a 2-bit load/store instruction, a 1-bitinter-entry data moving instruction, a 2-bit inter-entry data movinginstruction for transferring data between a data storage portion of anentry and a corresponding operational processing element(ALU), a 1-bitarithmetic/logic operation instruction, and a 2-bit arithmetic/logicoperation instruction. Further, by setting to “0” the value of a maskregister (V register) provided in the processing element, the operationof the corresponding entry can be masked and the operation can be set toan non-execution state.

The processing device of Reference 1 is on SIMD basis, and all entriesexecute one same arithmetic/logic operation in parallel. Therefore, whenone same arithmetic/logic operation is to be executed on a plurality ofdata sets, high-speed operation becomes possible and, therefore,filtering of image data, for example, can be executed at high speed.

Arithmetic/logic operations with low degree of parallelism, however,must be executed one by one successively while operations other than thetarget operation are masked, or it must be processed by a host CPU. Suchsuccessive processing of arithmetic/logic operations with low degree ofparallelism hinders increase in processing speed, and hence, theperformance of the parallel processing device cannot be fully exhibited.

Further, in communication between entries, in a configuration of SIMDtype architecture, all entries communicate in parallel with entriesapart by the same distance (in accordance with the data movinginstruction between entries). For each entry, to communicate with anentry apart by an arbitrary distance, however, it is necessary to adjustdistance of data movement by combining the moving instruction betweenentries (data moving instruction) and the mask bit of the V register inthe processing element. Therefore, parallel processing of data movementbetween entries at different distances is impossible.

If the arithmetic/logic operation and/or data moving process of lowdegree of parallelism could be performed efficiently, the processorwould have wider applications.

SUMMARY OF THE INVENTION

An object of the present invention is to provide a parallel processingdevice capable of efficiently performing processes such asarithmetic/logic operation and/or data moving process of low degree ofparallelism.

According to a first aspect, the present invention provides a parallelprocessing device, including: a data storage unit having a plurality ofdata entries each having a bit width of a plurality of bits and arrangedcorresponding to each entry; and a plurality of arithmetic/logicprocessing elements arranged corresponding to the data entries of thedata storage unit, of which content of an operational processing(arithmetic or logic operation) is set individually, for executing theset operation on applied data.

According to a second aspect, the present invention provides a parallelprocessing device, including: a data storage unit having a plurality ofdata entries each having a bit width of a plurality of bits and arrangedcorresponding to each entry; a plurality of arithmetic/logic processingelements arranged corresponding to the entries and each executing a setoperational processing (arithmetic or logic operation) on applied data;and a plurality of data communication circuits provided corresponding tothe plurality of entries and each performing data communication betweenthe corresponding entry and another entry. The plurality of datacommunication circuits each have inter-entry (entry-to-entry) distanceand direction of data movement set individually.

According to a third aspect, the present invention provides a parallelprocessing device, including: a data storage unit having a plurality ofdata entries each having a bit width of a plurality of bits and arrangedcorresponding to each entry; a plurality of arithmetic/logic processingelements arranged corresponding to the entries, having contents of anoperational processing (arithmetic or logic operation) set individually,for executing the set operational processing such as arithmetic/logicoperation on applied data; and a plurality of data communicationcircuits provided corresponding to the plurality of entries and eachperforming data communication between the corresponding entry andanother entry. The plurality of data communication circuits each haveentry-to-entry distance and direction of data movement set individually.

Further, contents of (arithmetic/logic) operation of thearithmetic/logic processing element of each entry and the amount anddirection of data movement of the data communication circuit are set inregisters for storing data to be processed and mask data for masking anoperation, provided in the arithmetic/logic element.

The parallel processing device, in accordance with the first aspect ofthe present invention, is configured to set contents of operation ineach arithmetic/logic processing element individually, and therefore,operations of low degree of parallelism can be executed concurrently indifferent entries, whereby performance can be improved. Particularly,data processing can be executed in a closed manner in the processingdevice, without the necessity of transferring data to the host CPU.Accordingly, the time required for data transfer can be reduced.

In the parallel processing device in accordance with the second aspectof the present invention, the amount of data movement is set in eachentry and data can be moved between entries at a high speed.Accordingly, the time required for data transfer can be reduced.

In the parallel processing device in accordance with the third aspect ofthe present invention, contents of operation and data for setting theamount of data movement are stored in each operational processingregister of the arithmetic or logic operation. Therefore, a dedicatedregister is unnecessary, and increase in layout area can be avoided.Further, the amount of data movement and contents of operation are setfor each entry, so that high speed processing can be realized.

The foregoing and other objects, features, aspects and advantages of thepresent invention will become more apparent from the following detaileddescription of the present invention when taken in conjunction with theaccompanying drawings.

BRIEF DESCRIPTION OF THE DRAWINGS

FIG. 1 schematically shows an overall configuration of a processingsystem utilizing the processing device to which the present invention isapplied.

FIG. 2 schematically shows a configuration of main processing circuitryshown in FIG. 1.

FIG. 3 shows a specific configuration of the memory cell mat shown inFIG. 2.

FIG. 4 schematically shows a configuration of an ALU processing elementincluded in the ALU group shown in FIG. 3.

FIG. 5 shows, in a list, instructions for operating a pointer registerof the main processing circuitry shown in FIG. 2.

FIGS. 6 and 7 show, in a list, ALU instructions of the main processingcircuitry shown in FIG. 2.

FIGS. 8 and 9 show, in a list, entry-to-entry data moving instructionsof the main processing circuitry shown in FIG. 2.

FIG. 10 schematically shows a configuration of an ALU processing elementin accordance with Embodiment 1 of the present invention.

FIG. 11 shows, in a list, correspondence between bits of MIMD registerand designated MIMD operation instructions, shown in FIG. 10.

FIG. 12 shows, in a list, logics of MIND operation instructions shown inFIG. 11.

FIG. 13 schematically shows regions designated by the pointer in memorymat shown in FIG. 3.

FIG. 14 shows a structure of MIMD operation instruction.

FIG. 15 schematically shows an exemplary internal configuration of anadder shown in FIG. 10.

FIG. 16 schematically shows interconnection areas of the inter-ALUconnecting switch circuit shown in FIG. 3.

FIG. 17 schematically shows interconnection arrangement of the 1-bit and4-bit shift interconnection areas shown in FIG. 16.

FIG. 18 shows an exemplary arrangement of interconnection lines in the16-bit shift interconnection area shown in FIG. 16.

FIG. 19 schematically shows interconnection arrangements of 64-bit and256 bit shift interconnection areas shown in FIG. 16.

FIG. 20 schematically shows a configuration of an inter-ALUcommunication circuit shown in FIG. 10 and correspondinginterconnections.

FIG. 21 shows an exemplary connection of interconnection lines to areception buffer shown in FIG. 20.

FIG. 22 represents a 2-bit mode zigzag copy instruction.

FIG. 23 represents a 1-bit mode zigzag copy instruction.

FIG. 24 schematically shows data flow in the zigzag copy mode.

FIG. 25 shows, in a list, control bits, shift distances and shiftdirections of inter-ALU communication circuit shown in FIG. 10.

FIG. 26 shows an example of a zigzag copy operation.

FIG. 27 shows an exemplary configuration of a 4-bit adder.

FIG. 28 shows a configuration when a 4-bit adder shown in FIG. 27 isdeveloped by a combination circuit.

FIG. 29 shows data arrangement of data entries at Stage 4 shown in FIG.28.

FIG. 30 shows movement of bits as the operation target, at Stage 4 shownin FIG. 28.

FIG. 31 shows the flow of instruction bits when an operation instructionis determined, at Stage 4 shown in FIG. 28.

FIG. 32 shows bit arrangement as a result of arithmetic/logic operationat Stage 4 shown in FIG. 28.

FIG. 33 shows an example of a 2-bit counter.

FIG. 34 shows a configuration when the 2-bit counter shown in FIG. 33 isimplemented by a sequential circuit of logic gates and flip-flops.

FIG. 35 shows a flow of data bits in one cycle of the 2-bit countershown in FIG. 34.

FIG. 36 shows, in a list, number of cycles required for simultaneousmoving operation of 16-bit data.

FIG. 37 shows data flow in a gather process in accordance withEmbodiment 1 of the present invention.

FIG. 38 shows, in a list, the number of entries, the number of necessarycycles and the bit width of control storage region, for the gatherprocess shown in FIG. 37.

FIG. 39 shows, in a graph, the number of entries and the number ofcycles shown in the table of FIG. 38.

FIG. 40 shows, in a graph, the number of entries and the control bitwidth shown in the table of FIG. 38.

FIG. 41 shows data flow at the time of de-interleave process in thedevice in accordance with Embodiment 1 of the present invention.

FIG. 42 schematically shows data flow in de-interleave process utilizinga vertical movement instruction.

FIG. 43 shows, in a list, the number of entries, the number of cyclesand the bit width of operation control memory regions in de-interleaveprocess shown in FIGS. 41 and 42.

FIG. 44 is a graph showing the number of entries and the number ofcycles shown in the table of FIG. 43.

FIG. 45 is a graph showing the number of entries and the bit width ofoperation control memory regions shown in the table of FIG. 43.

FIG. 46 shows a data flow in an anti-aliasing process.

FIG. 47 shows an exemplary data flow in an aliasing process, during theanti-aliasing process shown in FIG. 46.

FIG. 48 shows, in a list, the number of cycles and the bit width ofoperation control memory regions at the time of alias processing of32-bit data.

FIG. 49 schematically shows a configuration of an ALU processing elementin accordance with Embodiment 2 of the present invention.

FIG. 50 schematically shows a configuration of an ALU processing elementin accordance with Embodiment 3 of the present invention.

FIG. 51 schematically shows a configuration of an ALU processing elementin accordance with Embodiment 4 of the present invention.

FIG. 52 shows an exemplary configuration of an MIND instruction decoderin accordance with Embodiment 5 of the present invention.

FIG. 53 shows another configuration of an MIMD instruction decoder inaccordance with Embodiment 6 of the present invention.

FIG. 54 shows, in detail, the configuration of multiplexer shown in FIG.53.

FIG. 55 schematically shows a configuration of an MIMD instructiondecoder in accordance with Embodiment 7 of the present invention.

BEST MODES FOR CARRYING OUT THE INVENTION Embodiment 1

FIG. 1 schematically shows an overall configuration of a processingsystem utilizing a semiconductor processing device in accordance withEmbodiment 1 of the present invention. Referring to FIG. 1, theprocessing system includes a semiconductor processing device 1 executingparallel operations; a host CPU 2 performing process control onsemiconductor processing device 1, control of the whole system and dataprocessing; a memory 3 used as a main storage of the system and storingvarious necessary data; and a DMA (Direct Memory Address) circuit 4directly accessing to memory 3 without handling through host CPU 2. Bythe control of DMA circuit 4, data can be transferred directly betweenmemory 3 and semiconductor processing device 1, and semiconductorprocessing device can be accessed directly.

Host CPU 2, memory 3, DMA circuit 4 and semiconductor processing device1 are connected with each other through a system bus 5. Semiconductorprocessing device 1 includes a plurality of fundamental operation blocks(parallel processing devices) FB1 to FBn provided in parallel, aninput/output circuit 10 transferring data/instruction with system bus 5,and a central control unit 15 controlling operational processing such asarithmetic and logic operations and data transfer in semiconductorprocessing device 1.

Fundamental operation blocks FB1 to FBn and input/output circuit 10 arecoupled to an internal data bus 12. Central control unit 15,input/output circuit 10 and fundamental operation blocks FB1 to FBn arecoupled to an internal bus 14. Between each of the fundamental operationblocks FB (generally representing blocks FB1 to FBn), an inter-blockdata bus 16 is provided. In FIG. 1, an inter-block data bus 16 arrangedbetween neighboring fundamental operation blocks FB1 and FB2 is shown asa representative.

By providing fundamental operation blocks FB1 to FBn in parallel, sameor different arithmetic or logic operations are executed insemiconductor processing device 1. These fundamental operation blocksFB1 to FBn are of the same configuration, and therefore, theconfiguration of fundamental operation block FB1 is shown as arepresentative example in FIG. 1.

Fundamental operation block FB1 includes main processing circuitry 20including a memory cell array (mat) and a processor; a micro-programstoring memory 23 storing an execution program described in a microcode; a controller 21 controlling an internal operation of fundamentaloperation block FB1; a register group 22 including a register used as anaddress pointer; and a fuse circuit 24 for executing a fuse program forrepairing any defect of main processing circuitry 20.

Controller 21 controls operations of corresponding fundamental operationblocks FB1 to FBn, as control is passed by a control instructionsupplied from host CPU 2 through system bus 5 and input/output circuit10. These fundamental operation blocks FB1 to FBn each contains microprogram storing memory 23, and controller 21 stores an execution programin memory 23. Consequently, the contents of processing to be executed ineach of fundamental operation blocks FB1 to FBn can be changed, and thecontents of operations to be executed in each of fundamental operationblocks FB1 to FBn can be changed.

An inter-block data bus 16 allows high speed data transfer betweenfundamental operation blocks, by executing data transfer without usinginternal data bus 12. By way of example, while data is being transferredto a certain fundamental operation block through internal data bus 12,data can be transferred between different fundamental operation blocks.

Central control unit 15 includes: a control CPU 25; an instructionmemory 26 storing an instruction to be executed by the control CPU; agroup of registers 27 including a working register for control CPU 25 ora register for storing a pointer; and a micro program library storingmemory 28 storing libraries of micro programs. Central control unit 15has control passed from host CPU 2 through internal bus 14, and controlsprocessing operations, including arithmetic and logic operations andtransfer, of fundamental operation blocks FB1 to FBn through internalbus 14.

Micro programs having various sequential processes described in a codeform are stored as libraries in micro program library storing memory 28.Central control unit 15 selects a necessary micro program from memory 28and stores the program in micro program storing memory 23 of fundamentaloperation blocks FB1 to FBn. Thus, it becomes possible to address anychange in the contents of processing by the fundamental operation blocksFB1 to FBn in a flexible manner.

By the use of fuse circuit 24, any defect in fundamental operationblocks FB1 to FBn can be repaired through redundancy replacement.

FIG. 2 schematically shows a configuration of a main portion offundamental operation block FBi (FBI to FBn) shown in FIG. 1. Referringto FIG. 2, in fundamental operation block FBi, main processing circuitry20 includes a memory cell mat 30 in which memory cells are arranged inrows and columns, and a group of operational processing units (a groupof ALU (arithmetic and logic processing elements) 32 performing anoperational processing such as arithmetic or logic operations on datastored in memory cell mat 30. Memory cell mat 30 is divided into aplurality of data entries DERY. Data entry DERY includes data entrieshaving numbers 0 to MAX_ENTRY allotted thereto. Each data entry has bitpositions from 0 to MAX_BIT, and its bit width is MAX_BIT+1.

In the group of operational processing units (ALU group) 32, anoperational processing unit (hereinafter referred also to as anarithmetic logic unit or ALU processing element) 34 is arrangedcorresponding to each data entry DERY. For the group of operationalprocessing (arithmetic logic) units 32, switch circuit 44 forinterconnecting ALUs is provided.

In the following, an entry (ERY) is defined as encompassing the dataentry DERY and the ALE processing element provided corresponding to thedata entry.

The operation of main processing circuitry 20 is set by a program (microprogram) stored in program storing memory 23. Controller 21 executesprocessing in accordance with the program stored in program storingmemory 23.

In register group 22, pointer registers r0 to r3 are provided, Addressesof memory cell mat 30 of the data to be processed are stored in pointerregisters r0 to r3. Controller 21 generates an address designating anentry (data entry) or a location in a data entry of main processingcircuitry 20 in accordance with the pointers stored in pointer registersr0 to r3, to control data transfer (load/store) between memory cell mat30 and the group of arithmetic logic units 32.

In the group of arithmetic logic units 32, contents of operation of ALUprocessing element are determined dependent on the operation mode, thatis, determined commonly to all entries for an SIMD type operation anddetermined for each entry for an MIMD type operation. Further, inter-ALEconnecting switch circuit 44 also includes an inter-ALE data transfercircuit arranged corresponding to each entry. At the time ofentry-to-entry data transfer, the transfer destination can be setdependent on the operation mode, that is, commonly to all entries in theSIMD type operation and individually for each entry in the MIMD typeoperation.

When an SIMD type operation is executed and the same operation is to beexecuted among the entries, the contents of operation in the group ofarithmetic logic units (ALUs) 32 and the connection path of inter-ALUconnecting switch circuit 44 are commonly set by the control ofcontroller 21. As to the connection path, controller 21 selectivelycontrols setting of the path or route in accordance with an instructionstored in program storing memory 23, as indicated by dotted lines inFIG. 2 (in an MIMD type operation, the contents of operation and thetransfer destination are set in each entry, in accordance with thestored data of the data entry; in an SIMD type operation, the contentsof operation and the transfer destination are set by controller 21commonly to the entries).

FIG. 3 more specifically shows the configuration of main processingcircuitry 20 shown in FIG. 2. Referring to FIG. 3, memory cell mat 30 isdivided into two memory mats 30A and 30B. In memory mats 30A and 30B,memory cells MC are arranged in rows and columns. In FIG. 3, the memorycell MC has a configuration of a dual-port memory cell in which a writeport and a read port are provided separately. The memory cell MC,however, may be a single port memory cell. Memory cell MC is an SRAM(Static Random Access Memory) cell.

In each of memory mats 30A and 30B, corresponding to memory cells MCarranged aligned in the row direction, a write word line WWL and a readword line RWL are provided. Corresponding to memory cells MC arrangedaligned in the column direction, a write bit line pair WBLP and a readbit line pair RBLP are provided.

Each of the memory mats 30A and 30B has m data entries, that is, dataentries DERY0 to DERY(m−1). Corresponding to a set of each write bitline pair WBLP and read bit line pair RBLP, a data entry is provided.

By write word line WWL and read word line RWL, memory cells at the samebit position of data entries DERY0 to DERY(m−1) are selected inparallel.

Between memory mats 30A and 30B, the group of arithmetic logic units 32is provided. Though not explicitly shown in FIG. 3, for the group ofarithmetic logic units 32, inter-ALU connecting switch circuit (44) isprovided.

Between the group of arithmetic logic units 32 and memory mat 30A, asense amplifier group 40A and a write driver group 42A are arranged, andbetween the group of arithmetic logic units 32 and memory mat 30B, asense amplifier group 40B and a write driver group 42B are arranged.

Sense amplifier group 40A includes sense amplifiers SA arrangedcorresponding to read bit line pairs RBLP (RBLP0-RBLP(m−1)) of memorymat 30A, respectively. Write driver group 42A includes write drivers WBarranged corresponding to write bit line pairs WBLP (WBLP0-WBLP(m−1)) ofmemory mat 30A, respectively.

Similarly, sense amplifier group 40B includes sense amplifiers SAarranged corresponding to read bit line pairs RBLP (RBLP0-RBLP(m−1)) ofmemory mat 30B, respectively. Write driver group 42B includes writedrivers WB arranged corresponding to write bit line pairs WBLP(WBLP0-WBLP(m−1)) of memory mat 30B, respectively. When single portmemory cells are used, the write bit line pair WBLP and the read bitline pair RBLP are formed into a common bit line pair BLP, and to thebit line pair BLP, the sense amplifier and the corresponding writedriver are commonly coupled to each other.

For memory mat 30A, a read row decoder 36 rA for selecting a read wordline RWL, and a write row decoder 36 wA for selecting a write word lineWWL are provided. For memory mat 30B, a read row decoder 36 rB forselecting a read word line RWL, and a write row decoder 36 wB forselecting a write word line WWL are provided.

An input/output circuit 49 is provided for sense amplifier group 40A andwrite driver group 42A, as well as write driver group 42B and senseamplifier group 40B, for data transfer with the internal data bus (bus12 of FIG. 1).

Input/output circuit 49 receives and transfers in parallel the datatransferred to memory mats 30A and 30B. The data stored in memory mats30A and 30B may have bit positions re-arranged for each memory mat, or,alternatively, each of memory mats 30A and 30B may be provided with aregister circuit for converting data arrangement, and data writing andreading may be performed word line by word line between the registercircuit and the memory mat,

If the bit width of transfer data of input/output circuit 49 is smallerthan the number of entries (data entries), an entry selecting circuit(column selecting circuit) for selecting a data entry is providedcorresponding to the group of sense amplifiers and the group of writedrivers, though not explicitly shown in FIG. 3. A configuration in whichan appropriate number of data entries are selected in parallel inaccordance with the bit width of the transfer data of input/outputcircuit 49 can be used for such entry selection. Alternatively,input/output circuit 49 may have a bit width converting function, anddata transfer may be performed in parallel between input/output circuit49 and data entries DEERY0-DERY(m−1) and data transfer may be performedby the unit of bit width of internal data bus, between input/outputcircuit 49 and the internal data bus (bus 12 of FIG. 1).

In the configuration shown in FIG. 3, read row decoders 36 rA and 36 rBhave the same configuration, and in accordance with the same address,drive the read word lines of the same bit position to the selectedstate. When the result of arithmetic and/or logic operation is to bestored in memory mat 30A, write row decoder 36 wA is activated, and thecorresponding write word line is driven to the selected state. In thiscase, write row decoder 36 wB provided for memory mat 30B is maintainedin an inactive state.

In the configuration of main processing circuitry shown in FIG. 3, twomemory mats, that is, memory mats 30A and 30B are prepared, and betweenthe memory mats 30A and 30B, a group of ALUs 32 is arranged. Therefore,by storing data sets as the operation target in each of memory mats 30Aand 30B, it becomes possible to write data and read data in each machinecycle, whereby high speed arithmetic/logic operation (operationalprocessing) is achieved.

When a single port memory is used, the write row decoder and the readrow decoder are implemented by a common row decoder. In such aconfiguration, data load and store are executed in different machinecycles.

When an SIMD type operation is executed in main processing circuitry 20shown in FIG. 3, one same arithmetic/logic operation is executed inevery entry. The SIMD operation is executed in the following manner.

(i) Data bits DA[i] and DB[i] of the same bit position of data DA and DBas the operation target are read from memory mats 30A and 30B, andtransferred to ALU processing element of the corresponding entry(loaded).

(ii) In each ALU processing element, a designated arithmetic/logicoperation (operational processing) is executed on these data bits DA[i]and DB[i].

(iii) An operation result data bit C[i] is written (stored) at a bitposition of a designated entry. In parallel with the writing operation,the data DA[i+1] and DB[i+1] of the next bit position are loaded to theALU processing element.

(iv) The processes (i) to (iii) described above are repeated until allbits of the data of the operation target are operated and processed.

An execution sequence of an MIMD type operation will be described indetail later. An operation of 2-bit basis may also be executed (both inthe SIMD type operation and MIMI) type operation) and, in that case, twodata entries DERY constitute one entry ERY.

FIG. 4 schematically shows a configuration of an ALU processing element34 of a unit element included in the group of ALUs 32. In ALU processingelement 34, bit by bit and 2-bits by 2-bits arithmetic/logic operations(operational processing) are possible. In memory mats 30A and 30B, dataentries DERYA and DERYB each consist of even-numbered data entry DERYestoring data bits A[2i] of an even-numbered address and an odd-numbereddata entry DERYo storing data bits A[2i+1] of an odd-numbered address.Arithmetic/logic operation (operational processing) is performed inparallel on data bits of the same address (bit location) ineven-numbered data entry DERYe and odd-numbered data entry DERYo, andthe process is executed at high speed.

The even-numbered data entry DERYe and odd-numbered data entry DERYo ofdata entry DERYA are respectively coupled to internal data lines 65 aand 66 a. The even-numbered data entry DERYe and odd-numbered data entryDERYo of data entry DERYB are coupled to internal data lines 65 b and 66b, respectively,

ALU processing element 34 includes, as processing circuits forperforming arithmetic/logic operations, cascaded full adders 50 and 51.In order to set process data and contents of operation in ALU processingelement 34, an X register 52, a C register 53, an F register 54, a Vregister 55 and an N register 56 are provided, X register 52 is used forstoring operation data and for transferring data to another ALUprocessing element. C register 53 stores a carry in an additionoperation. F register 54 selectively inverts an operation bit inaccordance with a value stored therein, to realize a subtraction.

V register 55 stores a mask bit V for masking an arithmetic/logicoperation (including data transfer) in ALU processing element 34.Specifically, when the mask bit V is set to “1”, ALU processing element34 executes the designated arithmetic/logic operation (operationalprocessing), and when the mask bit V is set to “0”, the arithmetic/logicoperation is inhibited. Thus, the arithmetic/logic operation isselectively executed in the unit of ALU processing element.

ALU processing element 34 further includes an XH register 57 and an XLregister 58 for storing 2-bit data in parallel, a selector (SEL) 60selecting 2 bits of one of the data sets from registers 52, 57 and 58 inaccordance with a value stored in D register 59, a selection inversioncircuit 61 performing an inversion/non-inversion operation on 2 bitsselected by selector 60 in accordance with a bit stored in F register54, and gates 62 and 63 selectively outputting a sum output S of fulladders 50 and 51 in accordance with data stored in registers 55 and 56.

The outputted 2 bits of selection inversion circuit 61 are applied to Ainputs of full adders 50 and 51, respectively. X register 52 isconnected either to internal data line 65 a or 65 b by a switch circuitSWa, and connected either to internal data line 66 a or 66 b by a switchcircuit SWb. By the switch circuits SWa and SWb, in a 1-bit operation,data of one of memory mats 30A and 30B is stored in the X register, andin data transfer, the transfer data is stored in the X register.

XH register 57 is connectable to one of internal data lines 65 a and 65b through a switch circuit SWc, and connectable to one of internal datalines 66 a and 66 b through a switch SWm. XL register 58 is connectableeither to internal data line 66 a or 66 b through a switch circuit SWd.

The B input of full adder 50 is connected either to internal data line65 a or 65 b by a switch circuit SWe. Gate 62 is connected either tointernal data line 65 a or 65 b by a switch circuit SWf. The B input offull adder 51 is connectable to any of internal data lines 65 a, 65 b,66 a and 66 b by switch circuits SWg and SWh.

Gate 63 is connectable either to internal data line 65 a or 65 b by aswitch circuit SWj, and connectable either to internal data line 66 a or66 b by a switch circuit SWk.

By these switch circuits SWa-SWh, SWj, SWk and SWm, serial processing of1-bit unit in performing 2-bit parallel division is realized, and datatransfer of 2-bit unit and data transfer of 1-bit unit are realized indata transfer.

When ALU processing element 34 performs a 1-bit operation, that is, whenit performs an operation in 1-bit serial manner, a carry input Cin offull adder 51 is coupled by a switch 67 to C register 53. Gates 62 and63 execute a designated arithmetic/logic operation when values stored inV register 55 and N register 56 are both “1”, and otherwise, gates 62and 63 are both set to an output high impedance state.

The value stored in C register 53 is connected to carry input Cin offull adder 50 through switch circuit 67. When an arithmetic/logicoperation of 1-bit unit, or bit by bit basis operation, is executed,switch circuit 67 isolates the carry output Co of full adder 50, andconnects the carry input Cin of full adder 51 to C register 53 (at thistime, an addition is executed in full adder 51).

In ALU processing element 34 shown in FIG. 4, using X register 52 andregister 57, or XH register 57 and XL register 58, data can betransferred 2-bits by 2-bits between another entry and the correspondingentry.

For controlling such data transfer, in inter-ALU connecting switchcircuit 44, in correspondence to an entry, a movement data register(reconfigurable entry communication register: RECM register) 70, and aninter-ALU communication circuit (reconfigurable entry communicator:RECM) 71 for setting a data transfer path in accordance with data bitsE0-E3 stored in the movement data register 70 are provided.

In ALU processing element 34, in order to set contents of operationindividually entry by entry, an MIMD instruction register 72 and an MIMDinstruction decoder 74 decoding bit values M0 and M1 stored in the MIMDinstruction register to set contents of operation of full adder 50 andto generate a control signal for realizing a combination logic, areprovided. By bits M0 and M1 of MIMD instruction register 72, it becomespossible to realize different arithmetic/logic operation in each entry,whereby an MIMD (Multiple Instruction stream-Multiple Data stream) typeoperation is realized. Prior to description of the MIMD operation anddata transfer of ALU processing element 34, a group of instructionsprepared at the time of SIMD operation will be described.

As pointer registers designating addresses of the memory mat, pointerregisters p0 to p3 are used. Further, as shown in FIG. 2, pointerregisters r0 to r3 in general register are also utilized. Pointerregisters p0 to p3 are included in the group of registers shown in FIG.2.

FIG. 5 shows, in the form of a list, pointer register instructionsrelated to operations on pointer registers p0 to p3.

The instruction “ptr. set n, px” is for setting an arbitrary value n ina pointer register px. The arbitrary value n may assume any value withinthe range of the bit width (0 to MAX_BIT) of one data entry. The value xis any of 0 to 3.

The instruction “ptr. cpy px, py” is a copy instruction for transferringand storing the content of pointer register px to pointer register py.

The instruction “ptr. inc px” is for incrementing by one the pointer ofpointer register px.

The instruction “ptr. incl px” is for incrementing by two the pointer ofpointer register px.

The instruction “ptr. dec px” is for decrementing by one the pointer ofpointer register px.

The instruction “ptr. dec2 px” is for decrementing by two the pointer ofpointer register px.

The instruction “ptr. sft px” is for left-shifting by one bit thepointer of pointer register px.

By utilizing instructions “ptr. inc2 px” and “ptr. dec2 px”, 2-bitparallel processing becomes possible (odd-numbered and even-numberedaddresses are simultaneously updated). In the 2-bit operation, thoughthe pointer is incremented/decremented 2-bits by 2-bits, the position ofselected word line in the memory mat changes 1 row address at a time.

FIG. 6 shows, in the form of a list, load/store instructions of 1-bitoperation of the ALU processing element.

Referring to FIG. 6, the instruction “mem, ld. #R@px” is for storing(loading) the data bit at a position Aj[px] designated by the pointerregister px to register #R. Register #R is any of the X register, Nregister, V register, F register, D register, XL register, XH registerand C register. At the time of 1-bit ALU operation, the X register isused, and the XL register and XH register are not used.

The instruction “mem. st. #R@px” is for writing (storing) the valuestored in register #R to the memory cell position Aj[px] designated bythe pointer register px.

The store instruction is not executed when the mask register (V register55) is cleared.

In the store instruction also, the register #R is any of the X register,N register, V register, F register, D register, XL register, XH registerand C register.

The instruction “mem. swp. X@px” is for swapping the value stored in theX register 52 and the data at the memory cell position Aj[px] designatedby the pointer register px. The swap instruction is executed when “1” isset both in the mask register (V register 55) and N register 56. As theX register 52 is cleared/set by the data stored in the memory cell,circuit configuration can be simplified.

FIG. 7 shows, in the form of a list, load/store instructions for the ALUunit in 2-bit operation.

Referring to FIG. 7, the instruction “mem. 2. ld. X@px” is for storingthe data of memory cell positions Aj [px] and Aj[px+1] designated by thepointer register px to XL register 58 and XH register 57, respectively.Specifically, a lower bit of data at successive address positions isstored in the XL register 58 and a higher bit is stored in the XHregister 57.

The instruction “mem. 2. st. X@px” is for storing values stored in theXL register and the XH register, respectively, to the memory cells ofsuccessive addresses Aj[px] and Aj[px+1] designated by the pointerregister px. This operation is not executed when the mask register (Vregister) 55 is cleared.

The instruction “mem. 2. swp. X@px” is for swapping the data at theaddress Aj[px] designated by the pointer register px and a higheraddress Aj[px+1] with the values stored in the XL register 58 and XHregister 57, respectively. The swap instruction is not executed when theV register 55 and the N register 56 are both cleared.

In the 2-bit operation, successive addresses Aj[px] and Aj[px+1] areaccessed simultaneously using the pointer of pointer register px,whereby parallel processing of 2 bits is achieved. By utilizing this2-bit operation, data storage to movement data register 70 and MIME)instruction register 72 can also be executed.

In the 2-bit operation instruction, the XL and XH registers are used. Itis also possible, however, to use the XL and XH registers in an SIMDoperation and to use the X and XH registers for an MIMD operationinstruction. Further, the X register and the X11 register may be usedboth for the SIMD type and MIMD type operations.

FIG. 8 shows, in the form of a list, instructions for moving data (move:vcopy) between entries, in 1-bit operation. When data is moved betweenentries, the pointer register rn is used. Candidates of the pointerregister rn for movement data between entries include four pointerregisters r0 to r3.

The instruction “ecm. mv. n #n” is for transferring the value stored inthe X register of an entry j+n distant by a constant n to the X registerof entry j.

The instruction “ecm. mv. r rn” represents an operation in which thevalue of X register of entry j+rn distant by a value stored in theregister rn is transferred to the X register of entry j.

The instruction “ecm. swp” instructs an operation of swapping the valuesstored in the X registers Xj and Xj+1 of adjacent entries j+1 and j.

The moving of data between entries shown in FIG. 8 is commonly executedin each entry pair.

FIG. 9 shows, in the form of a list, operations of moving (move) databetween entries in the ALU for 2-bit operation. In the 2-bit operation,instruction descriptor “ecm2” is used in place of instruction descriptor“ecm”. By the designation of instruction descriptor “ecm2”,arithmetic/logic operation 2-bits by 2-bits is defined, and paralleldata transfer with XH and XL registers (or with XL and XH registers) isperformed. For the designation of contents to be transferred with theregisters, the same instruction descriptors as the 1-bit operation, thatis, “my. n#n”, “my. r rn” and “swp” are used.

Therefore, when an SIMD type operation is executed, at the time of datatransfer, the XH and XL registers may be used or the X and XH registersmay be used, as data registers. In the 2-bit unit movement operationalso, the amount of data transfer for each entry is the same.

Further, as arithmetic and logic operation (operational processing)instructions, addition instruction “alu.adc@px”, subtraction instruction“alu.sbc@px”, inversion instruction “alu.inv@px” and a register valuesetting instruction using a function value, that is, “alu.let f” areprepared.

By the addition instruction “alu.adc@px”, the data at the memory addressindicated by the pointer of pointer register px is added to the value inthe X register, and the result is returned to the memory mat. In thememory cell address Aj, the value after addition is stored, and a carryis stored in the C register.

By the subtraction instruction “alu.sbc@px”, from the data at the memoryaddress indicated by the pointer register px, the value stored in the Xregister is subtracted, and the result is returned to the memory mat.The value as a result of subtraction is stored in the memory cell at Aj,and the carry is stored in the C register.

By the inversion instruction “alu.inv@px”, the data at the memoryaddress indicated by the pointer of pointer register px is inverted andreturned to the memory mat (to the original position).

By the function value instruction “alu.let f”, values of F register, Dregister, and C register respectively are set by the corresponding bitvalues, in accordance with a function value represented by functionf=(F·8+D·4+N·2+C), with the symbol “·” indicating the multiplication.

Further, as 2-bit operation instruction, a booth instruction“alu2.booth” and an execution instruction “alu2.exe@px” are prepared.

The booth instruction “alu2.booth” is for performing multiplication inaccordance with the second order Booth algorithm, and from the values ofXH, XL and F registers, the values of N, D and F registers for the nextoperation are determined. Further, the execution instruction“alu2.exe@px” is an operation instruction that makes a conditionalbranch in accordance with values of D and F registers.

By utilizing these instructions, it becomes possible to execute anoperation or data transfer in each entry in accordance with the sameoperation instruction. Execution of instruction is controlled bycontroller 21 shown in FIG. 1.

Now, an MIMD type operation using data moving register (RECM register)70 and MIMD instruction register 72 shown in FIG. 4 above will bedescribed.

When an MIMD type logic operation is executed, an instruction“alu.op.mimd” is used. In the MIMD type operation, only logic operationinstructions are prepared as executable instructions. Specifically, fourinstructions, that is, AND instruction, OR instruction, XOR instructionand NOT instruction are prepared. The minimum necessary number of bitsfor selecting an execution instruction from these four instructions is 2bits. Therefore, in MIMD instruction register 72, 2-bit data M0 and M1are stored. When the contents of the MIMD type operation are added, thenumber of instruction bits is set in accordance with the number ofexecutable MIMD operations.

FIG. 10 schematically shows internal connections of the ALU processingelement when an MIMD type operation, that is, an instruction of MIMDtype is executed. In the following, the internal configuration of ALUprocessing element in execution of an MIND type instruction will bedescribed.

When an MIMD type instruction is executed, X register 52 and XH register57 are used as registers for performing 2-bit operation. When the MIMDtype instruction is executed, XL register 58 is not used. Therefore,switch circuit SWa connects internal data line 65 a to X register 52,and switch circuit SWm couples internal data line 66 a to XH register57. Switch circuit SWe couples internal data line 65 b to the B input ofadder 50, and switch circuit SWf couples an output of gate 62 tointernal data line 65 b. Switch circuit SWh connects internal data line66 b to the B input of adder 51, and switch circuit SWk connects anoutput of gate 63 to internal data line 66 b.

By MIMD instruction decoder 74, adder 50 executes any of ANDinstruction, OR instruction, XOR instruction and NOT instruction, asdescribed above. The result of logic operation is stored in data entryDERYB of memory mat 30B. When not one logic operation alone is done butthe same logic operations are executed in parallel by adders 50 and 51,a control signal outputted from MIMD decoder 74 is commonly applied toadders 50 and 51. Here, as an example, logic operation is executedindividually in each entry, using adder 50.

Further, inter-ALE communication circuit (RECM) 71 couples X register 52and XH register 57 to internal data lines in accordance with bit valuesE0-E3 stored in movement data register (RECM register) 70, and transfersdata to a transfer destination designated by the data bits E0-E3.

In ALU processing element 34 shown in FIG. 10, in accordance with thecontrol signal from MIMD instruction decoder 74, the content of internaloperation of adder 50 is set, and a designated logic operation isexecuted in each ALE processing element, and by inter-ALU communicationcircuit 71, data movement can be executed with the amount of datamovement and transfer direction set individually in each entry.

FIG. 11 shows, in a list, correspondence between the data bits MIMDinstruction bits) M0 and M1 stored in MIMD instruction register 72 andthe operations executed by adder 50. Referring to FIG. 11, when bits M0and MI are both “0”, negation operation “NOT” is designated. When bitsM0 and M1 are “0” and “1”, respectively, a logical sum operation “OR” isdesignated. When bits M0 and M1 are “1” and “0”, respectively, anexclusive logical sum operation “XOR” is designated. When bits M0 and M1are both “1”, a logical product operation “AND” is designated.

Therefore, in the present invention, four logic operations are preparedand by 2-bit MIMD instruction M0 and M1, the content of operation isdesignated. When the number of operation contents to be designatedincreases, the number of data bits stored in MIMD instruction register72 is also increased.

FIG. 12 shows, in a list, the MIMD operation instructions and thecontents executed correspondingly.

Referring to FIG. 12, M0 j and M1 j represent MIMD instruction bits inan ALU processing element ALUj, and Aj represents a result of operationin the processing element ALUj. Here, j indicates an entry number, andits range is the entry number 0 to MAX_ENTRY.

The operation instruction is executed when the mask bit Vj is “1”. Here,“!” represents a negation operation (inversion). Therefore, when bits M0j and M1 j are both “0” and mask bit Vj is “1”, the negation operationinstruction “alu.op.not” is executed. Here, in entry j, an invertedvalue !Aj[px] of bit Aj[px] designated by pointer px is obtained as theoperation result data bit Aj.

For a logical sum operation instruction “alu.op.or”, bit M0 j is set to“0” and bit M1 j is set to “1”. When the instruction is executed, maskbit Vj is “1”. By the logical sum operation, logical sum of the data hitAj[px] designated by pointer px and the data bit Xj stored in the Xregister is obtained.

For an exclusive logical sum operation “alu.op.xor”, bit M0 j is set to“1”, and bit M1 j is set to “0”. When the instruction is executed, maskbit Vj is “1”. By the logical sum operation, exclusive logical sum ofthe data bit Aj[px] designated by pointer px and the data bit Xj storedin the X register is obtained.

For a logical product instruction “alu.op.and”, bits M0 j and M1 j areboth set to “1”. Mask bit V is “1”. Here, a logical product of the databit Aj[px] designated by pointer px and the data bit Xj stored in the Xregister is obtained.

FIG. 13 schematically shows data bit storage regions in one data entry.The data entry DERY is divided into at least three regions RGa, RGb andRGc. The region RGa has its least significant address bit (startaddress) designated by pointer ap and has bit width of n bits. Theregion RGb has its start address bs designated by pointer by and has bitwidth of n bits from the start address bs. Region RGc is for storingmask data and an operation MIMD instruction data. The bit width of thisregion depends on the hardware (H/W), that is, the number of executableinstructions. In the memory mat, the bit width of region RGc isdetermined in accordance with the actual contents of operation to beexecuted, data bit width of the operation target and the number of data.The start address is set by a pointer cs.

Further, a temporary region for storing work data is also used.Configuration of data regions will be described later, together withspecific operation procedures.

FIG. 14 shows a form of an instruction when an MIMD type operation isexecuted, in which an operation instruction is executed individually ineach entry. The MIMD instruction is denoted by a code “mx_mimd”. TheMIMD operation “mx_mimd” is executed by controller 21 shown in FIG. 1. Aprototype of the MIND operation is represented by “void mx_mimd (intap,intbp, intcp, intn)”. The argument ap is a destination address, by asource address, and cp represents an MIMD instruction storage address.Further, n represents bit length of each region. Specifically, by intap,the start address “as” of region RGa shown in FIG. 13 is set, by intbp,the start address “bs” of region RGb shown in FIG. 13 is designated, andby intcp, the start address “cs” of region RGc shown in FIG. 13 is set.Here, n represents bit width of regions RGa and RGb. In the prototypeshown in FIG. 14, the bit width of each of regions RGa and RGb is set ton bits, and the bit width of region RGc is set to log 2 of the number ofexecutable instructions.

When the MIMD operation shown in FIG. 14 is executed, the followingprocess steps are executed,

Step 1: An mx_mimd instruction is executed by the controller. Inaccordance with a load instruction ld, MIMD operation instruction M0, M1at the bit position (address) designated by pointer cp is copied to MIMDinstruction register 72 shown in FIG. 10. Thus, the operation content“alu. op. mimd” to be executed by the entry unit is set. Here, “mimd” isany of “or”, “xor”, “and” and “not”.

Step 2: Content of the region at the bit position (address) designatedby pointer ap and the content of the region at the bit position(address) designated by pointer by are read bit by bit, and transferredto the ALU processing element (loaded).

Step 3: On the loaded data bits, the logic operation designated by thedata stored in MIMD instruction register 72 is performed. The MIMI)operation instruction is executed only when the mask bit (V register 55)is set to 1 in the ALU processing element.

Step 4: The result of operation is stored in a bit position (address)designated by pointer ap of region RGa shown in FIG. 13, having thestart address of as.

Step 5: The process of steps 2 to 4 is repeatedly executed on all thedata bits as the target of operation. Though each operation is done inbit-serial manner, the process is executed in parallel among a pluralityof entries, and taking advantage of the high speed operability of SIMDtype operation, operations of less parallelism can be executedconcurrently with each other, whereby high speed processing is realized.

When the MIMD operation is executed, pointers ap, by and cp are appliedcommonly to the entries of the memory mat, and in each entry, anoperation (logic operation) designated by the MIMD operation instruction“alu. op. mimd” is executed individually, in bit-serial manner.

FIG. 15 shows an exemplary configuration of adder 50 shown in FIG. 10.Referring to FIG. 15, adder 50 includes an XOR gate 81 receiving databits applied to inputs A and B, an AND gate 82 receiving bits at inputsA and B, an inverter 80 inverting the bit applied to input A, an XORgate 83 receiving a bit from a carry input Ci and an output bit of XORgate 81, an AND gate 84 receiving an output bit of XOR gate 81 and a bitfrom carry input Ci, and an OR gate 85 receiving output bits of ANDgates 82 and 84 and generating a carry output Co. A sum output S isapplied from XOR gate 83.

In adder 50, further, in order to switch internal path in accordancewith the MIMD control data, switch circuits 87 a to 87 g are provided,Switch circuit 87 a couples the output signal of inverter 80 to sumoutput S in accordance with an inversion instruction signal φnot. Switchcircuit 87 b couples the output of AND gate 82 to sum output S inaccordance with a logical product instruction signal φand. Switchcircuit 87 c couples the output of XOR gate 81 to sum output S inaccordance with an exclusive logical sum instruction signal φxor. Switchcircuit 87 e couples the output of XOR gate 81 to the first input of ORgate 85 in accordance with a logical sum instruction signal φor. Switchcircuit 87 f couples the output of OR gate 85 to the sum output S inaccordance with a logical sum instruction signal φor. Switch circuit 87d selectively couples the output of AND gate 84 to the first input of ORgate 85, in accordance with an inversion signal /φor of the logical suminstruction signal.

Switch circuit 87 g couples the output of XOR gate 83 to sum output S inaccordance with an inversion signal /φmimd of the MIND operationinstruction signal,

The MIMD instruction signal /mimd is set to an inactive state when anMIMD operation is done, and sets switch circuit 87 g to an output highimpedance state. Similarly, switch circuit 87 d attains to the outputhigh impedance state in accordance with the inversion signal /φor of thelogical sum instruction signal, when a logical sum operation isexecuted.

The adder 50 shown in FIG. 15 is a full adder having a generally usedcircuit configuration. Though an inverter 80 is additionally providedfor performing the negation operation, it may be provided to select theoutput of selection inversion circuit 61 shown in FIG. 10. In that case,bit value of F register (see FIG. 10) is set such that an inversionoperation is performed.

Alternatively, an inverter may be provided in XOR gate 81 and theinverter in XOR gate 81 may be used as an inverter for executing the NOToperation.

In the configuration of adder 50 shown in FIG. 15, when a negationoperation NOT is to be executed, switch circuit 87 a is renderedconductive, and other switch circuits are all rendered non-conductive,whereby the output signal of inverter 80 is transmitted to sum output S.

When a logical product operation AND is to be executed, logical productinstruction signal φand is activated, switch circuit 87 b is renderedconductive, and other switch circuits are rendered non-conductive(output high impedance state). Therefore, the output bit of AND gate 82is transmitted to sum output S through switch circuit 87 b.

When a logical sum operation OR is to be executed, logical suminstruction signal φor is activated, switch circuits 87 e and 87 f arerendered conductive, and other switches are set to output high impedancestate. Therefore, the output bit of OR gate 85 receiving the output bitsof XOR gate 81 and AND gate 82 is transmitted to sum output S. When theOR operation is executed, XOR gate 81 outputs “H” (“1”), when the bitvalues applied to inputs A and B have different logical values. AND gate82 outputs a signal of “1” when bits applied to inputs A and B are both“1”. Therefore, when at least one of the bits applied to inputs A and Bhas logical value “1”, a signal “1” is output from OR gate 85 throughswitch circuit 87 f to sum output S, and the result of OR operation isobtained.

As shown in FIG. 15, by selectively setting switch circuits 87 a to 87 gto the conductive state in accordance with the MIMD operation to beexecuted, the designated operation instruction can be executed using thelogic gates of internal elements of adder 50.

The configuration of adder 50 is merely an example, and a configurationsimilar to an FPGA (Field Programmable Gate Array), in which internalconnection paths are arranged in a matrix and interconnection is set inaccordance with the operation instruction signal, may be used.

Further, the configuration of full adder 50 shown in FIG. 15 is merelyan example and not limiting. Any full adder configuration can be used,provided that internal connection paths are set in accordance with theoperation instruction signal.

FIG. 16 schematically shows interconnection areas for data communicationbetween entries. Referring to FIG. 16, an interconnection area 90 fordata communication is provided between memory mat 30A and inter-ALUconnecting switch circuit 44. Interconnection area 90 for datacommunication includes an area 91 in which ±1 bit shift interconnectionlines are arranged, an area 92 in which ±4 bit shift interconnectionlines are arranged, an area 93 in which ±16 bit shift interconnectionlines are arranged, an area 94 in which ±64 bit shift interconnectionlines are arranged, and an area 95 in which ±256 bit shiftinterconnection lines are arranged.

A ±i bit shift interconnection line is for data communication betweenentries apart from each other by i bits. Here, interconnection lines for11 different types of data communications, including ±1, ±4, ±16, ±64and ±256 bit shifts and 0 bit shift, are prepared. As data communicationis performed in 2-bit unit (2-bits by 2-bits), interconnection lines fordata transfer using X register and XH register are arrangedcorresponding to each entry, in these interconnection areas 91 to 95.

FIG. 17 shows an exemplary arrangement of interconnection lines ininterconnection areas 91 and 92 shown in FIG. 16, Referring to FIG. 17,an interconnection arrangement is shown as an example having 1024entries and ALU processing elements 0 to 1023.

In FIG. 17, ±1 bit shift interconnection area 91 includes a +1 bit shiftinterconnection area 91 a and a −1 bit shift interconnection area 91 b.In +1 bit shift interconnection area 91 a, a line 100 a for transferringdata in one direction to an entry having a number larger by 1, and aline 100 b realizing 1-bit shift to an entry of the maximum number (ALU1023) are provided. Line 100 a performs shifting between neighboringentries (ALU processing elements), and therefore, lines 100 a arearranged in an aligned manner.

In the −1 bit shift interconnection area 91 b, similarly, a line 101 aconnecting neighboring entries and a line 101 b for data transfer fromthe entry of the minimum number (ALU element 0) to the entry of themaximum number (ALU 1023) are provided. Here again, lines 101 a arearranged in an aligned manner.

Therefore, in these interconnection areas 91 a and 91 b, per 1 bit oftransfer data, two interconnection lines are arranged. Therefore, wheninterconnections are made for 2-bit data transfer, the lines 100 a, 110b, 101 a and 101 b are arranged such that each of these perform 2-bitdata transfer in parallel.

The ±4 bit shift area 92 includes a +4 bit shift interconnection area 91a and a −4 bit shift interconnection area 92 b, FIG. 17 shows thearrangement of +4 bit shift area 92 a, and −4 bit shift area 92 b isindicated by a block in dotted line.

The +4 bit shift area 92 a includes interconnection lines 102 a arrangedbeing shifted in position from each other by one entry. There are fourinterconnection lines 102 a arranged in parallel, and each performs datatransfer to an entry apart or spaced by 4 bits. In this case also, inorder to perform +4 bit shift to an entry of large number, aninterconnection line 102 b is provided. In FIG. 17, numbers on lines 102a and 102 b represent entry numbers. Here, four +4 bit shiftinterconnection lines 102 a are arranged in parallel, and fourinterconnection lines 102 b achieving shift in the direction from themaximum number to the minimum number are arranged in parallel.Therefore, in interconnection area 92 a, 8 lines are arranged per 1 bitof transfer data.

As shown in FIG. 17, as the interconnection lines are arranged in arhombic quadrilateral, lines for shifting can be arranged efficientlywhile avoiding tangling, and the layout area for the interconnectionlines can be reduced.

Here, by arranging interconnection lines 100 b, 101 b and 102 h forentry return so as to overlap with interconnection lines 100 a, 101 aand 102 a for shifting, the interconnection layout area can further bereduced (multi-layered interconnection structure is utilized).

FIG. 18 schematically shows an exemplary arrangement of interconnectionlines in ±16 bit shift interconnection area 93 shown in FIG. 16. Here,the ±16 bit shift interconnection area 93 includes +16 bit shiftinterconnection areas 93 aa and 93 ab, and −16 bit shift interconnectionareas 93 ba and 93 bb. In the +16 bit shift interconnection area 93 aa,interconnection line 103 a connects to an entry apart by 16 bits. Forcyclic shift operation between entries, an entry return line 103 b isprovided. Here, in −16 bit shift interconnection area 93 ba, aninterconnection line 104 a is provided for connecting entries apart oraway by 16 bits. Interconnection line 104 b is an entry return line,which similarly connects entries away by 16 bits in cyclic manner.

In ±16 bit shift interconnection area 93, by arranging interconnectionlines for transferring 2-bit data shifted by 1 entry from each other,interconnection lines 103 a and 104 a can be arranged in parallel in theentry direction (vertical direction), and the interconnection layoutarea can be reduced. Here, in each of interconnection areas 93 aa, 93ab, 93 ba and 93 bb, 16 lines are arranged.

FIG. 19 schematically shows an exemplary arrangement of interconnectionlines in ±64 bit shift interconnection area 94 and ±256 bit shiftinterconnection area 95 shown in FIG. 16. Referring to FIG. 19, ±64 bitshift interconnection area includes +64 bit shift interconnection areas94 aa and 94 ab and −64 bit shift interconnection areas 95 ba and 94 bb.In each of these areas 94 aa, 94 ab, 94 ba and 94 bb, 64 interconnectionlines are arranged in parallel (per 1 bit of transfer data). Here, theshift line connects entries away by 64 bits, in the +direction and−direction, respectively.

Similarly, ±256 bit shift interconnection area 95 is divided intointerconnection areas 95 aa, 95 ab, 95 ba and 95 bb. Here, in each area,256 interconnection lines are arranged in parallel per 1 bit of transferdata, and entries away or distant by 256 bits are connected.

Using such shift lines, interconnections for performing shiftingoperations of ±4 bits, ±16 bits, ±64 bits and ±256 bits are provided foreach entry, whereby it becomes possible to set the amount of datamovement (distance between entries and the direction of movement) foreach entry in moving data. In the following description, “amount of datamovement” refers to both distance and direction of movement,

FIG. 20 schematically shows a configuration of an inter-ALUcommunication circuit (RECM) 71 shown in FIG. 4. In FIG. 20, X register52 and X11 register 57 included in ALU processing element 34 are shownas representatives. X register 52 and XH register 57 are connected tointernal data lines 65 a and 66 a, respectively, at the time of an MINDoperation and MIND data transfer, as shown in FIG. 10.

Inter-ALU communication circuit 71 includes a transmission buffer 120receiving values stored in X register 52 and XH register 57, amultiplexer 122 for setting a transfer path of a data bit from transferbuffer 120 in accordance with bits B0 to E3 stored in the movement dataregister, and a reception buffer 124 receiving transmission data througha signal line 116 commonly coupled to a group of interconnection linesfor the ALU processing element and generating data after transfer.

Multiplexer 122 selectively drives one of signal lines 110 au-110 edprovided corresponding to the entry. Signal lines 110 au to 110 ed eachare a 2-bit signal line, representing the ±1 bit shift interconnectionline to ±256 bit shift interconnection line shown in FIGS. 17 to 19. Asshown in FIG. 20, shift interconnection lines are provided for eachentry, and connection destination of such shift interconnection lines110 au to 110 ed is set in a unique manner. By way of example, +1 bitshift interconnection line 110 au is coupled to a reception buffer of aneighboring ALU processing element of an entry having a number larger by1, and −1 bit shift interconnection line 110 ad is coupled to areception buffer of a neighboring entry having a number smaller by 1.

Reception buffer 124 commonly receives corresponding group of signallines (±1 bit shift lines to ±256 bit signal lines). The signal lines ofthe group of signal lines 115 are subjected to wired OR connection.

FIG. 21 schematically shows connection of a signal line 116 to thereception buffer. The group of signal lines 115 is connected inone-to-one correspondence between entries, with data transfer directionstaken into account, as shown in FIGS. 17 to 19 above. Specifically, thegroup of signal lines 115 includes ±1 bit shift signal lines, ±16 bitshift signal lines, ±64 bit shift signal lines and ±256 bit signallines. These are wired-OR connected commonly to signal line 116.

Upon data transfer, in inter-ALU communication circuit 71, multiplexer122 selects a data transfer signal line (bit shift line) in accordancewith values B0 to E3 stored in the movement data register, and couplesthe selected shift signal line to transmission buffer 120. Therefore,for one ALU processing element, one shift signal line is selected. Theshift signal line is a one-directional signal line, and in the entry(ALU processing element 34) of the transfer destination, by signal line116 coupled to reception buffer 124, one signal line of the group ofsignal lines 115 is driven. Therefore, even when the group of shiftsignal lines is wired-OR connected, data can reliably be transferred andreceived by the entry of the transfer destination and the transfer datacan be generated.

Here, if the load on the signal line 116 is considered too heavy andhigh speed data transfer through transmission buffer 120 may bedifficult, a multiplexer for reception similar to multiplexer 122 isprovided in reception buffer. Here, the multiplexer for receptionselects the source of data transfer based on the information at the timeof data transfer. By setting the same data as the movement data E0 to E3of the data transfer source, as the reception buffer selection controldata at the destination of data transfer, it becomes possible for thereception buffer 124 to select the shift signal line on which thetransfer data is transmitted.

FIG. 22 shows an exemplary description of an instruction for moving databetween entries. FIG. 22 represents a programmable zigzag copy (2-bitmode) in which data are moved 2-bits by 2-bits. The 2-bit mode copy codeis represented by “mx2_cp_zp”. A prototype of the 2-bit mode copy isrepresented by “void mx2_cp_zp (intap, intbp, intcp, intn)”. Here, theargument, ap, is a destination address, and argument, bp, is a sourceaddress. The argument, cp, is an address for storing distance ofmovement between entries, and the argument, n, represents bit length ofthe transfer data storage region.

In the 2-bit copy code, the data of distance of movement between entriesdesignated by pointer cp is copied in 2-bit unit, in an RECM register(movement data register). Contents of n bits from the initial or startaddress bs designated by pointer by are transferred in 2-bit unit to theentry designated by the data in RECM register. At the entry as thetransfer destination, the transferred data are copied in 2-bit unit, ina region starting from the initial address as indicated by pointer ap.

FIG. 23 shows description of an instruction for programmable zigzag copy(1-bit mode) in which data are moved bit by bit. The 1-bit mode copycode is represented by “mx_cp_zp”. A prototype of the 1-bit mode copy isrepresented by “void mx_cp_zp (intap, intbp, intcp, intn)”. Thearguments ap, by and cp of the 1-bit mode copy code are the same as thedefinition of arguments of 2-bit mode copy code. When the instruction of1-bit mode copy code is executed, an operation similar to that forexecuting the 2-bit mode zigzag copy instruction is performed, exceptthat the copy operation is executed bit by bit.

FIG. 24 schematically shows data movement when the programmable zigzagcopy instruction shown in FIGS. 22 and 23 is executed. FIG. 24 shows, asan example, data transfer from data entry DERYa to data entry DERYb. Asshown in FIG. 24, in the zigzag copy mode, in accordance with the dataE0 to E3 of the amount of movement stored in the region RGc of 4 bitwidth from the start address cs designated by pointer cp in data entryDERYa, the data entry DERYb as the transfer destination is set (mannerof connection or routing of multiplexer 122 of FIG. 20 is set).

Thereafter, the data of region RGb of n-bit width starting from thestart address bs designated by pointer by are transferred to the regionRGa of n-bit width starting from start address as designated by pointerap of data entry DERYb, in 1-bit unit (when 1-bit mode programmablezigzag copy instruction is executed) or in 2-bit unit (when 2-bit modeprogrammable zigzag copy instruction is executed). Data transfer pathsare provided in one-to-one correspondence between entries, and data canbe transferred without causing collision of data, by designating thedata transfer destination individually for each entry.

Data transmission is performed using the X register and the XH register,and data reception is performed using the reception buffer. Here, afteronce storing the received data in X/XH register, the transfer data maybe stored at bit positions designated by the address pointer ap, inaccordance with a “store” instruction. Alternatively, in the zigzag copyoperation, data may be directly written from the reception buffer to bitpositions designated by address pointer ap, through an internal signalline.

Transmission and reception are not performed simultaneously. By way ofexample, transmission and reception may be done in the former half andin the latter half of one machine cycle, respectively. Alternatively,transmission and reception may be performed in different machine cycles.Thus, transmission and reception can be performed in one entry.

Selective activation for transmission and reception may be set, forexample, by the mask bit V. When execution of a “load” instruction ismasked at the time of transmission and execution of, a store instructionis masked at the time of reception by the mask bit V, transmission andreception can be executed selectively. Alternatively, by driving a bitline pair of the corresponding data entry using the reception buffer, itbecomes possible to execute writing of received data in all entries inparallel (the address pointer at the time of writing is the same for allthe entries, as the word line is common to all entries).

FIG. 25 shows, in a list, the amount of data movement E0-E3,communication distance and communication direction, stored in movementdata register (RECM register) 70 shown in FIG. 10. By the 4-bit movementdata E0-E3, the direction of communication can be set to an up (+)direction (in which entry number increases) and a down (−) direction (inwhich entry number decreases), and the data communication distance canbe set to any of 1, 4, 16, 64 and 256. Including the communicationdistance of 0, a total of 11 different types of data communicationbetween entries can be realized.

FIG. 26 shows an example of data movement between entries. In FIG. 26,entries ERY0 to ERY8 are shown as representatives. In inter-ALUconnecting switch circuit 44, the data transfer path is set inaccordance with the movement amount data E0 to E3. For entries ERY0,ERY2, ERY3 and ERY7, a +1 bit shift operation is set. For entry ERY1, a+4 bit shift operation is designated. For entry ERY4, a −4 bit shiftoperation is designated, and for entry ERY6, a −4 bit shift operation isdesignated. Further, for entry ERY8, a −1 bit shift operation isdesignated.

In FIG. 26, arrows represent data movement, and the root of each arrowindicated by a black circle is coupled through a multiplexer to thetransmission buffer, and the tip end of the arrow is coupled to thereception buffer of the transmission destination.

The interconnection lines between entries arranged for inter-ALUconnecting switch circuit 44 are one-directional lines, and hence, amongentries ERY0 to ERY8, data movement can be executed in parallel withoutcausing collision of data.

Now, an operation when the programmable zigzag copy instruction shown inFIGS. 22 and 23 is executed, will be described.

Step 1: When data movement is to be performed individually in each entryby zigzag copying, first, data representing the amount of data movementof the corresponding entry is set in advance in a region designated bythe pointer cp of the data entry. At this time, mask bit V is set in adifferent region.

Step 2: Controller (21) executes the zigzag copy instruction, and underthe control of the controller, the entry movement amount data E0 to E3stored in the region designated by the pointer cp of data entry arestored in the movement data register (RECM register). Therefore, thisoperation is performed commonly in every entry.

Step 3: In accordance with the movement data E0 to E3 stored in the datamovement register (RECM register), connection path of the multiplexer(element 122 of FIG. 20) is set.

Step 4: In accordance with the data of the operation target (data to bemoved) and dependent on whether it is a 1 bit mode copy or 2 bit modecopy, the transmission data is set in the register (X register and XHregister, or X register) in the ALU processing element. At this time,the data in the region having the bit width of n bits designated by thepointer by of the data entry are stored in the register of thecorresponding ALU processing element. This operation is also executedcommonly on all entries under the control of controller (21).

Step 5: The data set in the register for transfer (X and XH registers,or X register) are transferred to the entry at the destination ofmovement through multiplexer 122 shown in FIG. 20. At the entry at thedestination of movement, the data that have been transferred through thereception buffer are stored bit by bit or 2 bits by 2 bits in the regiondesignated by the pointer ap of the corresponding data entry (thisoperation is also executed with the pointer generated commonly to allentries by the controller 21).

Step 6: The process of Step 3 to Step 5 is executed repeatedly until allthe data bits to be moved are transferred.

At the time of this data transfer, when bit “0” is set in the maskregister (V register), data setting and transmission from the data entryof an entry to a corresponding data register (X, XH and the movementdata registers) are not performed.

Next, processing of an MIMD operation will be described.

Step 1: First, in a region having the bit width of n bits designated bythe pointer cp of a data entry, an instruction (M0, M1) for performingan MIMD operation is set.

Step 2: An appropriate MIMD instruction among the MIMD operationinstructions set in the data entries is stored in the MIMD instructionregister, by executing a load instruction under the control ofcontroller (21).

Step 3: A register load instruction is executed on the data of theoperation target under the control of controller (21), data bits of bitpositions designated by pointers ap and by of data entry regions (RGaand RGb) are transferred to the corresponding ALU processing element,and one data bit (which is transferred first) is set in the X register.In the ALU processing element, an operation content is set by the MIMDinstruction decoder such that the instruction set in the MIMDinstruction decoder is executed. On the data loaded from addresspositions designated by pointers ap and by of the data entry, the setoperation is executed. The result of operation is stored at the bitposition of the data entry designated by the pointer ap, by executing astore instruction, in controller (21).

Step 4: Until the number of times of operations reaches a designatednumber, that is, until the processing of all operations on the data bitsof the operation target is complete, the process of Step 3 is repeatedlyexecuted. Whether the operation number has reached the designated numberor not is confirmed by checking whether the point ap or by reached theset maximum value or not.

When an SIMD operation is to be executed, under the control ofcontroller 21 shown in FIG. 2, connection path of inter-ALU connectingswitch circuit 44 is set commonly for all entries, and the content ofoperation of the ALU processing element 34 is also set commonly to allentries. Here, pointer control of data entries DERY is executed inparallel by controller 21, and one same instruction is executed inparallel in all entries. Now, a specific operation of performing a 4-bitaddition will be considered.

[Exemplary Application of a Combination Circuit]

FIG. 27 shows an exemplary configuration of a common 4-bit adder. Asshown in FIG. 27, a 4-bit adder adding 4-bit data A0-A3 and B0-B3 isimplemented by seven half adders (HA) 130 a-130 g and three OR gates 132a-132 c. As the internal configuration of the half adder, one using anXOR gate and an AND gate, one using an AND gate, an OR gate and a NOTgate, or various other configurations may be used. Half adders 130 a-130d respectively receive 2 bits at corresponding positions. Half adders(HA) 130 a-130 g are provided for generating outputs S3 to S1, and ORgates 132 a-132 c are used for generating carries c3 to c1. Half adders130 e-130 g receive carry outputs of the half adders of the precedingstage (1-bit lower half adders) and sum outputs of the half adders ofthe corresponding bit positions. OR gates 132 a-132 c receive carryoutputs of the half adders of the corresponding bit positions.

When the 4-bit adder shown in FIG. 27 is realized by a combinationcircuit including an 1-input, 1-output NOT gate, a 2-input, 1-output ANDgate, a 2-input, 1-output OR gate and a 2-input, 1-output XOR gate, thelogic circuit of 4-bit adder shown in FIG. 28 can be obtained.

As shown in FIG. 28, a 4-bit addition is executed, divided into eightstages STG. In this configuration, half-addition operations that can beperformed in parallel are executed in parallel, followed by an operationof receiving the carry propagation later. The configuration of the 4-bitadder realized by the combination circuit of logic gates shown in FIG.28 can be found by developing the 4-bit adder shown in FIG. 27 in logicgates, considering carry propagation.

In the 4-bit adder shown in FIG. 28, from 4-bit inputs AIN[3:0] andBIN[3:0], a 4-bit output DOUT[3:0] and a carry output C_OUT aregenerated.

The logic operation of logic circuit shown in FIG. 28 is executed withthe parallel processing device described in the foregoing, stage bystage successively in accordance with the MIMD instruction. Referring toFIG. 28, at the time of MIMD operation, in each stage STG, one cell(logic gate) is allocated to one entry. Each time the stage STG changes,an output signal of the logic gate is propagated to a different entry,and the amount of movement of the logic gate output differs cell bycell. Further, the operation executed in the entry (cell) differs ineach stage. Therefore, the amount of movement and the instruction areset individually for each entry, and mutually different MIMD operationsare executed.

FIG. 29 shows state of data stored in data entries at the start of thestage, when an operation of stage STG 4 is executed in the logiccircuits shown in FIG. 28. As data entries, data entries DERY0 to DERY7are used. At positions designated by address pointer ap of four dataentries DERY0 to DERY3, respective bits of 4-bit data A are stored, andsimilarly at positions designated by address pointer ap of data entriesDERY4 to DERY 7, respective bits of 4-bit data B are stored.Consequently, when the MIMD instruction is executed, different from anexecution of an SIMD type operation, data of the operation target arestored dispersed over a plurality of entries, and the result ofoperation is transmitted to entries of propagation destinations ofrespective logic gates and stored in temporary regions.

Temporary regions t1 to tmp store process data, and at addressesdesignated by temporary pointers t1, t2 and t3, output values of logicgates of respective stages are stored. At the region designated bytemporary pointer tmp, the other operation data of each entry is stored.Specifically, in each entry, a binary operation is executed on the databit stored at the bit position indicated by temporary pointer ti (i isother than mp) and the data bit stored at the bit position indicated bythe temporary pointer tmp. When a negation operation involving aninversion is executed, the inverting operation is performed on the databit stored at the bit position indicated by the temporary pointer ti (inthe following, generally referred to as a temporary address ti whereappropriate).

FIG. 29 shows a data flow when an operation A+B=(0011)+(1101) isperformed.

MIMD instruction bits are stored in MIMD instruction register of thecorresponding ALU processing unit in 2-bit mode, at data entries DERY0to DERY7.

Before the start of operation stage STG 4 (at the end of stage STG3),operations are performed in four data entries DERY0, DERY2, DERY5 andDERY7 (each respective mask bit V (content of V register) is set to“1”). Here, operation instruction bits M0 and M1 of data entries DERY0and DERY2 indicate an AND operation, and MIMD operation instruction(bits M0, M1) of data entry DERY5 designates a NOT operation. MIMDinstruction bits M0 and M1 of data entry DERY7 designates an ORoperation. Data entries DERY0 and DERY2 have executed the operation ofAND gate in the preceding stage of OR gate G2 of stage STG4, and storingthe result of this operation at temporary address t3. Data entries DERY5and DERY7 store the output of an inverter of the preceding stage of gateG1 and the output of an OR gate in the preceding stage of gate G3,respectively, at temporary address t3.

Specifically, in FIG. 29, at the start of stage STG4 (at the completionof stage STG3), output values of stage STG3 are established, logicalvalues of data entries DERY0 and DERY2 are “1”, the result of negationat data entry DERY5 is “0”, and the result of OR operation “1” is storedin data entry DERY7. At the time of operation, in data entries DERY0 toDERY7, operations are selectively executed in accordance with the maskbit (contents of V register), and the result of operation is stored attemporary address t3 of the corresponding data entry. Therefore, in dataentries DERY 0 and DERY 2, an AND operation of bits at temporaryaddresses t3 and tmp is performed, and bit “1” is stored at temporaryaddress t3.

In data entry DERY 5, a NOT operation is performed, the bit value “1”that has been stored previously is inverted, and bit “0” is stored attemporary address t3. In data entry DERY7, an OR operation of bit valuesstored at temporary pointers t3 and tmp is performed, and the result ofoperation is again stored at temporary address t3, Therefore, “1” isstored at temporary address t3 of data entry DERY7.

Then, in order to execute the operation of stage STG4 shown in FIG. 28,data are rearranged.

Here, data entry DERY1 is allocated as the operation region of OR gateG2, data entry DERY4 is allocated as the region of AND gate G1, and dataentry DERY5 is allocated as the region of OR gate G5 The region of dataentry DERY6 is allocated to inverter G3 performing a NOT operation, Dataentry DERY7 is allocated to the AND gate G4 for performing an ANDoperation.

FIG. 30 schematically shows data movement when the operations of stageSTG4 are performed, OR gate G2 must receive an output bit of the ANDgate of the preceding stage. Here, the output value of the AND gate inthe preceding stage of OR gate G2 is stored at the bit positionsdesignated by the temporary pointer t3 of data entries DERY0 and DERY2,and these data bits are transferred to temporary address t4 of dataentry DERY1. Here, the bit at temporary address t3 of data entry DERY0is stored at temporary address tmp of data entry DERY1, and the bit attemporary address t3 of data entry DERY2 is stored at temporary addresst4 of data entry DERY1.

To data entry DERY4, AND gate G1 is allocated. Here, the outputs of aninverter and the OR gate of the preceding stage are moved to data entryDERY4. Specifically, the bit at temporary address t1 of data entry DERY2is moved to temporary address tmp of data entry DERY4, and the output ofthe inverter at temporary address t3 of data entry DERY5, which has beenestablished in stage STG3, is moved to temporary address t4 of dataentry DERY4.

Data entry DERY5 is allocated to OR gate G5. Here, outputs of AND gateand OR gate of the preceding stage of OR gate G5 must also be moved, andthe data at the bit position indicated by temporary pointer t1 of dataentry DERY2 and the data bit of data entry DERY1 indicated by temporarypointer t2 are moved to positions indicated by temporary pointers tmpand t4, respectively.

Data entry DERY6 is allocated to inverter G3. Here, it is necessary tomove the output bit of the OR gate of the preceding stage to temporaryaddress t4 of data entry DERY6. The result of operation of temporaryaddress t3 of data entry DERY7 operated at the preceding stage STG3 istransferred to the position of temporary address t4 of data entry DERY6.

Data entry DERY7 is allocated to AND gate G4. The AND gate G4 receivesmost significant bits BIN[3] and AIN[3]. Therefore, the data at the bitposition indicated by address pointer ap of data entry DERY7 is moved tothe position indicated by temporary pointer tmp, and the data bit at thebit position indicated by address pointer ap stored in data entry DERY3is moved to the position of temporary address t4 of data entry DERY7.Thus, input of respective gates G1 to G5 of stage STG4 are stored at bitpositions indicated by temporary pointers t4 and tmp of respective dataentries.

In this data moving operation, basic amounts of data movement are ±1,±4, +16, ±64 and ±256. Therefore, the data are transferred, as far aspossible, to regions indicated by the basic amounts of data movement. Atthe time of this data transfer, the zigzag copy instruction describedabove is used. By way of example, first, data transfer to the regionindicated by temporary pointer t4 takes place and, thereafter, byexecuting the zigzag copy instruction, data are moved to the temporaryaddress tmp in the similar manner. The data movement may be done inreverse order. Specifically, at the time of data movement to temporaryaddresses t4 and to tmp, data may be moved to temporary address tmpfirst.

In the data movement, the amount of data movement at data entries DERY2and DERY3 is +2. Therefore, for the data movement between these twoentries, +1 bit shift operation is executed twice.

In this data moving operation, in each data entry, data bits at the samebit positions are read (loaded) by a row decoder, not shown, and dataare transferred and stored. Therefore, when data are transferred totemporary addresses t4 and tmp, data are moved while pointers ap and t1to t4 are updated. Here, whether the movement is to be executed or notis determined by the mask hit V of the mask register (V register).

In this data transfer, first, the load instruction may be executed withthe source address changed successively, to store transfer data bits inthe corresponding X registers in respective entries and, thereafter,data transfer (1-bit mode zigzag copy instruction) may be executed whilechanging the destination address, with the destination being temporaryaddresses t4 and tmp. By way of example, when the copy instructionmx_cp_zp shown in FIG. 23 is executed, pointer by is successivelyupdated to store transfer data bits in corresponding X registers, andthen, a transfer instruction is executed to activate the transmissionbuffer, whereby the data transfer from the X register to the destinationentry is executed. As the destination address is successively updated tot4 and tmp, the mask bit V is set/cleared in accordance with thedestination address, and data is moved correctly from each entry totemporary addresses t4 and tmp.

FIG. 31 shows an operation in the entry when MIMD instruction bits arestored. When an MIMD instruction is to be set for each entry, the MIMDoperation instruction mx_mimd shown in FIG. 14 is executed, and as theMIMD instruction bits, instruction bits M0 and M1 for stage STG4designated by pointer cp are copied to the MIMD register. At this time,the bit value of mask register (V register) is set to “1” in dataentries DERY1 and DERY4 to DERY7 in which operations are to be carriedout, and the mask bit is set to “0” in other entries. Thus, MIMDinstruction bits stored in data entries DERY1 and DERY4 to DERY7 arestored in the MIMD instruction register, and the operations to beexecuted are designated.

Thereafter, as shown in FIG. 32, in accordance with bit values M0 and M1set in the MIMD instruction register, MIND operation instruction “alu.op. mimd” is executed on the bits at addresses t4 and tmp. In FIG. 32,an OR operation is done in data entry DERY1, an AND operation is done indata entry DERY4, an OR operation is done in data entry DERY5, a NOToperation is done in data entry DERY6, and an AND operation is done indata entry DERY7.

In these operations, the operation is executed on bit values stored attemporary addresses t4 and tmp, and the result of operation is stored ata bit position of temporary address t4. In the data entry where theoperation is not executed, the corresponding mask bit V is “0”. Afterexecution of operations at stage STG4, results of operations are storedat bit positions of temporary address t4 of data entries DERY1 and DERY4to DERY7.

Thereafter, through similar processing, operations of stages STG5 toSTG8 are executed.

As the MIMD instruction control bit, the MIND instruction control bit orbits necessary for each stage is or are stored. Therefore, the bit widthof the region for storing the MIMD operation instruction control bits isset in accordance with the number of operation stages, and the bit widthof the region designated by the temporary pointer is also set inaccordance with the number of operation stages.

[Exemplary Application of Sequential Circuit]

FIG. 33 shows a general configuration of a 2-bit counter as an exampleof a sequential circuit. The 2-bit counter 33 shown in FIG. 33 includestwo stages of cascaded D flip-flops DFF0 and DFF1. D flip-flop DFF0 ofthe first stage receives at a clock input a clock signal CLK, and Dflip-flop DFF1 of the next stage receives at a clock input a signal froman output /Q of the D flip-flop DFF0 of the first stage. D flip-flopsDFF0 and DFF1 have their complementary outputs /Q coupled to theirinputs D. From outputs Q of D flip-flops DFF0 and DFF1, count bits Q0and Q1 are output, respectively.

In the 2-bit counter shown in FIG. 33, D flip-flops DFF0 and DFF1 outputthe state of a signal at the D input immediately before a rise of asignal applied to the clock input. Therefore, the state of the signal atthe output Q of each of D flip-flops DFF0 and DFF1 changes insynchronization with the rise of the signal applied to its clock input.The configuration of the 2-bit counter shown in FIG. 33 is also used asa frequency divider dividing the frequency of clock signal CLK.

FIG. 34 shows a configuration in which the 2-bit counter of FIG. 33 isimplemented by XOR gates and an AND gate. Referring to FIG. 34, the2-bit counter includes flip-flops FF0 and FF1, an XOR gate G10 receivingthe signal at output Q of flip-flop FF0 and an input signal IN, an ANDgate G11 receiving the input signal IN and a signal from output Q offlip-flop FF0, and an XOR gate G12 receiving an output signal of ANDgate G11 and a signal from output Q of flip-flop FF1. The output signalof XOR gate G10 is applied to an input D of flip-flop FF0, and theoutput signal of XOR gate G12 is applied to an input D of flip-flop FF1.

To the clock inputs of flip-flops FF0 and FF1, a clock signal CLK isapplied commonly.

In the 2-bit counter shown in FIG. 34, flip-flops FF0 and FF1 arerealized by memory cell regions held in data entries. In the 2-bitcounter shown in FIG. 34, as the number of stages of logic operations,three stages STG are used. Signals are taken and held in flip-flops FF0and FF1 by storing output values of XOR gates G10 and G12 atcorresponding bit positions in the corresponding data entry.

FIG. 35 shows an exemplary bit arrangement when the operation of 2-bitcounter shown in FIG. 34 is emulated. At data entries DERY0 to DERY7,the input signal IN is stored at the bit position designated by addresspointer ap. The input signal IN has a bit value “1”. Bit values oftemporary addresses t1-t3 correspond to output bits of stages STG1 toSTG3, respectively. Temporary address tmp is not used in the 2-bitcounter operation.

In order to store the values stored in flip-flops FF0 and FF1, at thedata entry, pointer addresses FF0 and FF1 are prepared (here, both theflip-flops and the pointer addresses indicating the bit positions aredenoted by the same reference characters).

In FIG. 35, eight data entries DERY0 to DERY7 are provided from thefollowing reason. There are four initial states of flip-flops FF0 andFF1, and for the four initial states, one set of four data entries isused. This is to represent an operation of one stage by one set of dataentries. In FIG. 35, states of stages STG2 and STG3 are represented bythe set of data entries DERY4 to DERY7 and the set of data entries DERY0to DERY3. In the 2-bit counter shown in FIG. 34, the count operation canbe emulated using four data entries. For the MIMD instruction bits, aregion of 6 bits is secured to successively execute the XOR operation,AND operation and XOR operation in correspondence to stages STG1 toSTG3, respectively (the region for storing mask bit and the like is notshown).

In data entries DERY0-DERY3, operation instruction (control) bits M0 andMI are set to “1, 0”, and an XOR operation is designated. On the otherhand, for the data stored in data entries DERY4 to DERY7, operationinstruction (control) bits M0 and M1 are both set to “1”, and an ANDoperation is designated.

First, the process of operation in data entries DERY0 to DERY3 will bedescribed. In the region indicated by temporary address t3, the initialvalue of flip-flop FF1 is stored. As to flip-flop FF0, the result ofoperation of stage STG1 differs dependent on the initial value, and theresult of operation is stored in address pointer FF0. In FIG. 35, thebit value representing the result of operation at stage STG1 is storedat temporary address t1.

The bit value of temporary address t2 corresponds to the state beforethe rise of clock signal CLK, and it is a logical value before datastorage to flip-flop FF0. Therefore, the bit value at the position oftemporary address t2 and the value stored in flip-flop FF0 have oppositelogical values.

At stage STG3, an XOR operation is performed on the bit value oftemporary address t2 and the bit value stored in flip-flop FF1, and theresult of operation is again stored in the bit position of flip-flopFF1.

Specifically, in data entries DERY0 to DERY3, as initial values offlip-flops FF1 and FF2, (0, 0), (0, 1), (1, 0) and (1, 1) are stored atpointer addresses FF0 and FF1. Before the rise of the clock signal CLK,in accordance with the values stored in flip-flop FF0, the output valueof XOR gate G10 is determined, the bit value of temporary address t1 isdetermined, and as the clock signal CLK rises, the value stored inflip-flop FF0 is determined by the output bit value of XOR gate G10.

At stage STG2, in accordance with the value stored in flip-flop FF0before the rise of clock signal CLK, the output bit value of AND gateG11 is determined, and the bit value is stored at temporary address 12.Therefore, hit values of temporary addresses t2 and t1 have oppositelogical values.

At stage STG3, in accordance with the output value of AND gate G11 andthe value stored in flip-flop FF1, the output value of XOR gate G12 isdetermined. The output value of XOR gate G12 is stored in flip-flop FF1in synchronization with the rise of clock signal CLK. FIG. 35 shows thestate when the XOR operation has been done, before the rise of clocksignal CLK, at stage STG3. Specifically, at temporary address t3, thevalue stored in flip-flop FF1 is set as the input bit value, the XORoperation is done on the bit values of temporary addresses t2 and t3,and the result of operation is stored at pointer address FF1, as thevalue stored in flip-flop FF1, at the completion of operations of stageSTG3. At the time of this operation, the result of XOR operation iswritten (stored) in temporary address t3, and thereafter, the contentsof temporary address t3 are written to the position of pointer addressFF1. Thus, it follows that, in the subsequent processing, the valuestored in flip-flop FF1 can always be set as the input bit to XOR gateg12, at the start of execution of stage STG3.

In the bit arrangements of data entries DERY4 to DERY7, the operation ofstage STG2 is about to be executed. At stage STG2, MIMD instruction bits(control bits) M0 and M1 are both set to “1”, and an AND operation isexecuted.

Here, at stage STG1, in accordance with the value stored in flip-flopFF0, the logical value of its output bit (output bit of XOR gate) isdetermined. XOR gate G11 operates as an inverter, and at temporaryaddress t1, an inverted value of the value of flip-flop FF0 is stored.

When the operation of stage STG2 is executed, data has not yet beenwritten to flip-flop FF0, and pointer addresses FF0 and FF1 of dataentries DERY4 to DERY7 are shown maintaining the initial values of the2-bit counter. Therefore, the bit value of temporary address t2 of stageSTG2 is equal to the logical value of the bit stored in flip-flop FF0(the input signal IN has the logical value “1”).

At stage STG2, a logical product operation (AND operation) on the valuestored in flip-flop FF0 and the bit at the bit position of addresspointer ap is executed in each entry.

As shown in FIG. 35, by preparing logic operations as the MIMD typeoperation instructions, each entry executes an operation individually,and it becomes possible to emulate a sequential circuit.

Further, by repeatedly executing the operations, in data entries DERY0to DERY7, the states of flip-flops FF0 and FF1 can be stored at pointeraddresses FF0 and FF1, and thus, the state of flip-flops can berepresented.

As described above, by adding the MIMD instruction register and adecoder in the ALU processing element, it becomes possible to have aparallel processing device of an SIMD type architecture operate as anMIMD type processing device. Consequently, it becomes possible toexecute different instructions in different entries at one time, and theprocess time can be reduced.

Further, by the MIMD instruction register and decoder, it becomespossible to achieve emulation of a logic circuit on the parallelprocessing device. Specifically, a NOT element (1-input, 1-output), anAND element (2-input, 1-output), an OR element (2-input, 1-output) andan XOR element (2-input, 1-output) constitute a complete logic system,and therefore, every combination circuit can be represented. Further, bypreparing a region for holding data in the data entry, a sequentialcircuit such as a flip-flop or a latch can also be represented. Thus,every hardware circuit can be implemented by the parallel processingdevice in accordance with the present invention. Thus, in the parallelprocessing device, a software executing portion in accordance with theSIMD instruction and a hardware executing portion utilizing logiccircuits can be provided together, and as a result, a processing deviceof very high versatility can be realized.

Further, when the 2-bit counter shown in FIG. 33 or 34 is formed byhardware circuitry, flip-flop FF0 experiences a gate delay of one stagewhile flip-flop FF1 experiences a gate delay of two stages. Therefore,in order to adjust operation timing in synchronization with the clocksignal, it is necessary to set timings in consideration of delays of twostages of gates G11 and G12. This means that the operation margin of theclock signal must be enlarged, which makes it difficult to increase thespeed of clock signal. In the parallel processing device, however,operation process is done in each stage, and the cycle of each stage isdefined by the clock signal (clock signal of the parallel processingdevice). The result of operation of each execution stage and the inputcan be read from memory cells at an arbitrary timing. Therefore, in the2-bit counter, the critical path of the first stage flip-flop is thegate delay of one stage, and that of the second stage flip-flop is thegate delay of two stages. For each flip-flop, the critical path can bechanged. Therefore, timing adjustment between flip-flops becomesunnecessary, and correct operation processing and high speed operationcan be realized.

Further, as the MIMD operation is made possible, dependency on thedegree of parallelism of processes can be reduced, so that the parallelprocessing device (MTX) comes to have wider applications. As a result,operations that were conventionally handled by host CPU can be closedwithin the parallel processing device (MTX), and therefore, the timenecessary for data transfer between the CPU and the parallel processingdevice (MTX) can be reduced. Thus, the process performance of theoverall system can be improved.

Further, data processing can be set by the entry unit in areconfigurable manner, so that complicated data transfer (verticalmovement; data movement between entries) can be controlled moreflexibly, and high speed data transfer becomes possible.

FIG. 36 shows, in a list, the number of cycles necessary for datamovement in accordance with vertical movement instruction “vcopy” (sameas the “move” instruction) used in the conventional semiconductorparallel processing device (MTX), and in accordance with the circuit forcommunication between entries (RECM: Reconfigurable Entry Communicator)in accordance with Embodiment 1 of the presents invention. In the tableof FIG. 36, a parallel processing device (MTX) approach simulator,version 0.03.01 was used, and a 2-bit ALU performing 2-bit unitoperation was used as a model. For the RECM, calculation of cycles inthe approach simulator version 0.03.01 was used to be formed into alibrary,

In FIG. 36, cycle numbers necessary for the amounts of data movement of1 bit 2 bits, 4 bits, 8 bits, 16 bits, 32 bits, 64 bits, 128 bits, 256bits, 512 bits and 1024 bits are listed. When the vertical movementinstruction “vcopy” (=move) is used, data movement in one direction isset commonly for all data entries. Here, 16 bits of data are moved2-bits by 2-bits. For the transfer of data bits, 8 cycles are necessary.Further, cycles necessary for data movement including load and store oftransfer data are set in advance by the approach simulator.

When the entry communication is done using the RECM, as shown in FIG.36, it is necessary to store communication control data E0 to E3 in thecommunication control register (RECM register). Therefore, the number ofcycles for the movement becomes longer by the number of cycles (3 cyclesin FIG. 36) for this operation. When the amount of data movement is abasic amount of movement, the necessary cycle number is 26 cycles whenthe vertical movement instruction is executed, and when the movement ofthe same distance is repeated, the number of cycles necessary for datacommunication becomes longer by 8 cycles.

Therefore, when communication with entries at the same distance is to bedone, as in the case of simultaneous movement using the vertical copyinstruction “vcopy” with all entries being at the same distance, theoperation would be slower if the data movement to individual entry isexecuted using RECM. Further, for each entry, the control data forsetting the amount of data movement must be stored in the data entry,and therefore, the region for storing the communication control datamust be provided in the memory mat.

When entries communicate with entries of different distances, however,the communication control can be realized for each entry, and hence, theprocess can be completed in smaller number of cycles. The reason forthis is as follows. When the communication distance (data movementdistance) differs from entry to entry, according to the conventionalmethod, it is necessary to execute selective movement using the verticalmovement instruction “vcopy” and the mask bit of the mask register (Vregister). Therefore, it is necessary to repeatedly execute datamovement for each data movement amount, and hence, the process takeslong time. When communication is controlled using the RECM register,however, communication distance of each entry can be selected in onecommunication, and hence, the data movement process can be completed ina shorter time. By way of example, when data movement such as shown inFIG. 26 is to be done, there are five different amounts of datamovement. Therefore, it is necessary to execute the data transferinstruction five times if the vertical movement instruction “vcopy” orthe moving instruction “move” is used. According to Embodiment 1 of thepresent invention, however, data movement can be completed by one datacommunication with the amount of movement set for each entry, so thatthe time of data transfer can be reduced.

Further, as the interconnection lines for data movement, conventionallyused interconnection lines for executing the movement instruction“vcopy” or “move” in the conventional parallel processing device candirectly be applied. Therefore, the number of cycles required for datamovement can be reduced without increasing the area for interconnectionlines. In the following, specific data movement processes will bedescribed.

[Gather Process]

The gather process refers to a data moving process in which data atevery 8 entries are collected, and the collected entry data are arrangedin order from the first of the entries. Generally, this process isperformed in image processing to introduce fluctuating noise(analog-like noise) at a boundary region, thereby to make smooth tonechange at the boundary region. FIG. 37 shows a flow of data movementwhen this process is executed on 64 entries.

In FIG. 37, 2048 entries are prepared as entries in the memory cell mat.Contents A to H of entries ERY7, ERY15, ERY23, ERY31, ERY39, ERY47,ERY55 and ERY63 are taken out, and arranged in order, starting fromentry 0. In the gather process shown in FIG. 37, the following processsteps are executed.

Step 1: First, data (E0-E3) for controlling data movement are stored inthe control data storage region of the data entry. Here, a data storageregion in the entry is set commonly in each entry, in accordance with apointer.

Step 2: From the data entry, the data E for controlling data movement isstored in the corresponding RECM register (movement data register), inaccordance with the amount of data movement. As shown in FIG. 37, whenthe data Ea for controlling data movement is used a plurality of timessuccessively, it is initially set once at the time of movingcommunication data, and then, data movement between entries is executedrepeatedly.

Step 3: Then, in accordance with the value E (E0-E3) stored in the RECMregister, the distance and direction of movement are set for each entry,and the data are moved.

Referring to FIG. 37, first, in accordance with the group of data Ea forcontrolling movement, data B of entry ERY15 is transferred to entryERY2047. Entry ERY7 is shifted downward by 1 bit. Data C of entry ERY23is transferred to entry ERY7 (moved by −16 bits). Remaining entry dataD-H are each moved by −16 bits, and the entry positions of each of thedata D-H is shifted to a direction smaller by 16 bits (down shiftoperation).

Next, in each entry, the same moving instruction is executed. The datastored in entry ERY15 is transferred to entry ERY2047, and data of entryERY7 is down-shifted by 1 bit. In entry ERY23, a −16 bit shift isperformed, and in other entries ERY23, ERY31, ERY39 and ERY47, similarshift operation is performed. At this time, data B of entry ERY2047 istransferred to entry ERY3 (4-bit up-shift).

Again in the next cycle, data movement is performed in accordance withthe group of control data Ea, so that data B, D, A, C, E and G arestored in entries ERY2 to ERY7, and data H is stored in entry ERY15.

Step 4: Then, the pointer is moved and the next movement control data isstored in the corresponding RECM register. In accordance with the groupof movement control data Eb, data are moved. Specifically, a 4-bitup-shift operation is performed on the data of entry ERY2047, a 4-bitdown-shift operation is performed on data A of entry ERY4, a 1-bitdown-shift operation is performed on remaining entries ERY2, ERY3,ERY5-ERY7, and a 4-bit down-shift operation is performed on entry ERY15.

Step 5: The pointer is updated and the next movement control data isstored in the corresponding RECM register. The transfer data is storedin the X/XH register, and data are moved in accordance with the controldata Ec. This moving operation is executed repeatedly on the transferdata bits. By this process, in ERY3 and ERY4, 1-bit up-shift anddown-shift are performed, and the data are stored and exchanged. Thedata of entry ERY11 is stored, by 4-bit shift, in entry ERY7.

Step 6: The last group of movement control data is stored in thecorresponding RECM register, and the amount of movement is set for eachentry. The transfer data is stored in the X/XH register, and data aremoved in accordance with the control data group Ed. In this process,contents of entries ERY2 and ERY3, and contents of entries ERY4 andERY5, are exchanged in accordance with the control data group Ed, Thus,data A-H come to be successively stored from entry ERY0 to ERY7.

Therefore, by individual moving operations, such gather process can alsobe achieved, and high speed processing becomes possible.

In the data moving flow shown in FIG. 37, there are entries on whichdata movement is performed and not performed, in accordance with themovement control data Ea. Whether data movement is to be executed or notcan be set by the mask bit, and the mask bit may be set/cleared for eachmoving operation. Further, at this time, once the mask bit is set andthe moving operation is done on the entry, unnecessary transfer data canbe rewritten by data transferred in a subsequent cycle, as shown byhatched blocks in FIG. 37. Thus, even when the mask bit is set to thesame state for the same group of movement control data, there is noparticular problem as long as the data can be rewritten by the transferdata of a later cycle.

FIG. 38 shows, in a list, relation between the number of entries,necessary number of cycles and the control bits when the gather processshown in FIG. 37 is performed on 16-bit data. The V flag represents amask bit stored in the mask register (V register). The control bitrepresents the number of data (E0-E3) that determine the transferdestination/amount of movement when the zigzag copy mode is executed.Here, the region for storing initial data is not shown. The relationbetween the number of entries and the number of execution cycles shownin FIG. 38 is also obtained utilizing the cycle number calculation ofsimulator version 0.03.01 described above.

The example of 64 entries shown in FIG. 38 corresponds to the operationof FIG. 37. As for the control bits, there are four different movinginstructions, and they are 16 bits in total, and in order to stop datatransfer of regions (entries) unrelated to the data transfer, a maskflag is used. Therefore, a 4-bit mask flag is used for each case, andthe number of control bits becomes larger by the number of mask bits,than the moving instruction used for each movement.

FIG. 39 shows the relation between the number of entries and the numberof cycles in the gather process shown in FIG. 38. In FIG. 39, theabscissa represents the number of entries and the ordinate representsthe number of cycles. As can be seen from FIG. 39, as the number ofentries increases, the number of necessary cycles naturally increases.When data movement is performed individually for each entry using theRECM register, however, the number of cycles is smaller than in theconfiguration in which movement in the same direction is done in eachentry in accordance with the vertical movement instruction “vcopy” or“move”.

FIG. 40 represents the number of entries and the bit width of the regionoccupied by control bits, in the gather process shown in FIG. 38. InFIG. 40, the abscissa represents the number of entries, and the ordinaterepresents the bit width of control data storing region, Referring toFIG. 40, when data movement is performed by setting the amount of datamovement in the entry unit using the RECM register, it is necessary tostore the movement control data for each entry, and hence, the bit widthof the region for storing the control bits significantly increasesrelative to the bit width of the region storing the mask bit (V flag)when the common vertical movement instruction “vcopy” is used. Thetransfer data, however, contains 16 bits, and the bit width of one entryis sufficiently wide (the width corresponds to the number of memorycells of one bit line pair, and corresponds to the number of wordlines). Therefore, there is sufficient margin for providing the regionfor storing the control bits.

[De-Interleave Process]

De-interleave process refers to a process for moving data stringsaligned in the vertical direction of entries such that data ofeven-numbered entries are stored in upper half area of the entry groupand data of odd-numbered entries are stored in lower half area of theentry group.

FIG. 41 shows data flow in the de-interleave process in the parallelprocessing device in accordance with Embodiment 1 of the presentinvention. Referring to FIG. 41, a state SA represents the initial stateof data stored in the entry ERY, and a state SB represents the state atthe end of processing 4 entries. A state SC represents a state at thecompletion of de-interleave process, when there are 8 entries ERY.

As shown in FIG. 41, data are moved successively, using movement controldata groups Ea-Ec, and contents of even-numbered entries andodd-numbered entries are exchanged for each of the entries, whereby thedata of even-numbered entries and the data of odd-numbered entries canbe classified.

FIG. 42 shows a process procedure when the moving operation is performedin accordance with the common copy instruction (vertical movementinstruction) “vcopy” of the SIMD type movement instruction. Here, thedata of respective entries must be moved in the same direction by thesame amount, and therefore, the process is executed, separately for datamovement of even-numbered entries and for data movement of odd-numberedentries. As shown in FIG. 42, when the data are moved using the verticalcopy instruction “vcopy”, the original data (initial data) must bedivided into even-numbered entries and odd-numbered entries forclassification, and hence, the data must be held in each of the entriesERY0 to ERY7. Therefore, the transfer data are held in a temporaryregion, and the data held in the temporary region are transferredsuccessively. At the time of this transfer, it is again necessary totransfer data separately for data of even-numbered entries and for dataof odd-numbered entries. Accordingly, the temporary region must have aregion for transferring data of odd-columns and a region fortransferring data of even-columns, and thus, the temporary region havingdouble the bit width of transfer data must be provided.

FIG. 43 shows, in a list, the number of entries, the bit width of datastorage region and the number of transfer cycles when 16-bit data areprocessed in the de-interleave process shown in FIGS. 41 and 42. Thenumber of cycles shown in FIG. 43 is also obtained using the RECMsimulator version 0.03.01 described above, for the data transfer of2-bit unit.

When the number of entries is 4, what occurs is simply a transition frominitial state SA to state SB, as shown in FIG. 41. Therefore, only themovement control data Ea is used as the control bits. Accordingly, whenthe RECM is used, 4 bits of the instruction Ea are necessary as thecontrol bits (the mask bit is not shown: a 0-bit shift may be executedas the data movement, and in that case, the mask bit is unnecessary).

In the RECM, a cycle for setting the movement control data to the RECMregister becomes necessary, and at the time of data transfer on 2-bitunit, cycles for storing the transfer data in the X/XH register, datatransfer, and for writing the transfer data and the like at the transferdestination are required. Even when the number of entries is 4, 33cycles are necessary for data movement. When the number of entries is 4and the vertical copy instruction “vcopy” is used, it is necessary touse 2 bits of mask flag to inhibit transfer when data of even-numberedrows are transferred and data of odd-numbered rows are transferred.Further, in each movement data must be transferred by the same amount ofdata, and therefore, the number of cycles significantly increases to 172cycles.

FIG. 44 is a graph representing the relation between the number ofentries and the number of cycles, of the comparison table ofde-interleave process shown in FIG. 43. FIG. 45 shows the number ofentries and the width of control bits, of the table shown in FIG. 43. InFIG. 44, the abscissa represents the number of entries and the ordinaterepresents the number of cycles. In FIG. 45, the abscissa represents thenumber of entries and the ordinate represents mask/control bit width.

As can be seen from FIG. 44, by moving data using the RECM register,high-speed data movement becomes possible. Further, as can be seen inFIG. 45, if the number of data movement entries is small, the regionused in the memory mat can be reduced. Here, it is unnecessary toprovide a temporary region for storing data of even-numbered rows ofentries and data of odd-numbered rows of entries, and the moved data candirectly be written to positions indicated by the original addresspointers. Therefore, though the bit width of control bits increases whenthe RECM is used, the temporary region can be eliminated, and the bitwidth of the region used in the memory mat for data movement can be madecomparable to or smaller than when the vertical movement instruction isexecuted,

[Anti-Aliasing Process]

An alias generally refers to an imaginary data not included in theoriginal data. The anti-aliasing process is to remove or avoid the aliascomponent. In the field of image processing, the anti-aliasing processmeans removal of jaggies (jaggy or stepwise portions along the pixels ofthe figure) of generated figures. The anti-aliasing process includes anoperation of calculating a mean value among pixels of the region ofinterest. The aliasing process is for exchanging, among data aligned invertical directions over entries, data of a prescribed range.

FIG. 46 shows an example of exchange of data arrangement in the aliasingprocess. Referring to FIG. 46, data of entries ERY10 to ERY25 are sortedin the vertical direction, to be arranged in the order of original datastored in entries ERY25 to ERY10.

FIG. 47 schematically shows the data flow when the data are verticallymoved at the time of data arrangement exchange in the alias process.Referring to FIG. 47, first, a process of moving data of upper or lower8 entries among 16 entries is executed, and then, data transfer of theremaining 8 entries is executed. Here, by successively reducing theamount of data transfer from 8, 4, 2 to 1, sorting of data can berealized. In Embodiment 1, however, the basic amounts of data transferare ±1, ±4 and ±16, and therefore, communication with a position away by8 entries require execution of 4-bit shift operation twice. For a 2-bitshift operation, an 1-bit shift operation must continuously be repeatedtwice. At the time of this data transfer operation, as an example,execution of a +shift instruction represented by a solid line, and a−shift operation represented by a chain-dotted line, are executedalternately. When the regions for data holding and for data movementcollide, correct data transfer becomes impossible. Therefore, atemporary region is necessary to hold the intermediate data. As thetransfer takes place twice, two temporary data regions for +shift and−shift operations are necessary.

Further, data transfer is performed in entries ERY10 to ERY25, whiledata movement is not performed in other entries. Therefore, it isnecessary to mask data transfer using the mask bit. Further, a mask isnecessary also when the +shift and −shift are performed alternately.

FIG. 48 shows a result of simulation when the alias process of 32 bitdata shown in FIG. 47 was executed. As the simulator, approach version0.03.01 was used, and in the simulator, a cycle number calculationsimulator provided as a library is utilized. When 32-bit data are to bemoved using the vertical movement instruction “vcopy”, 613 cycles arenecessary, a region of 8 bits is necessary as the mask bit patternstoring region, and bit width for two data, that is, 64 bits, isnecessary for storing temporary bits. When the RECM register is used,the necessary number of cycles is 442, and 16 bits are used as controlbits. Specifically, it is necessary to execute the movement instructionfour times (as the amount of movement is limited to 1, 4 and 16).

Therefore, as can be seen from the table of FIG. 48, data communicationusing the RECM register requires smaller number of cycles, and higherspeed of operation can be realized.

Further, when the RECM register is used, the contents of from entriesERY0 to ERY25 can be moved entirely, and therefore, the temporary regionbecomes unnecessary. Thus, the width of memory mat region used for datacommunication can be reduced.

Provision of the RECM register using data movement communicationcircuitry attains the following effects. Specifically, when the verticalmovement instruction “vcopy” or “move” is used that instructssimultaneous movement, data can be moved only between entries of thesame distance at one time. Therefore, when data movement over differentdistance entry by entry is necessary, movement between entries must berepeated a number of times in accordance with the amount of datamovement. When the inter-ALU data communication circuit (RECM register)in accordance with Embodiment 1 of the present invention is used,however, the distance of data movement between entries can be set andthe data can be moved, entry by entry in a programmable manner.Consequently, high-speed data movement between entries becomes possible.Further, dependent on the amount of data movement, data can be movedover desired, different distances for each entry, simply by onceexecuting the data movement instruction.

Further, by simply switching the selection signal of a multiplexer fordata movement between entries to a control signal of an RECM register(including a decode circuit) from the overall control (control bycontroller 21 (see FIG. 1)) of the parallel processing device (mainprocessing circuitry), data transfer on entry unit can be controlled,and addition of new interconnection resources is unnecessary,

Embodiment 2

FIG. 49 schematically shows a configuration of an ALE processing elementin accordance with Embodiment 2 of the present invention. Theconfiguration of ALU processing element shown in FIG. 49 differs fromthat of ALU processing element in accordance with Embodiment 1 shown inFIG. 1 in the following point. Specifically, C register 53, F register54, D register 59 and XL register 58 are used as registers for storingmovement control data E0 to E3. In other words, in place of MIMDinstruction register (RECM register) 70, operation registers provided inALE processing element 34 are used. In data movement, an arithmeticoperation or a logic operation is not executed, and therefore, XLregister 58, D register 59, C register 53 and F register 54 are notused. In the data movement operation, these unused registers areutilized as registers for storing movement control data, and hence itbecomes unnecessary to provide an MIMD instruction register (RECMregister) for this purpose only, and thus, the area occupied by theinter-ALE connecting switch circuit can be reduced.

Except for this point, the configuration of ALU processing element shownin FIG. 49 is the same as that of ALU processing element shown in FIG.10, and therefore, corresponding portions are denoted by the samereference characters and detailed description thereof will not berepeated. It is noted, however, that in FIG. 49, a multiplexer (MUX) 150is shown as an example, for switching a path of movement control betweenan operation of SIMD type architecture and an operation of MIMD typearchitecture. Multiplexer 150 selects, in accordance with the modecontrol signal S/M switching between execution of an SIMD instructionand an MIMD instruction, one of the control signal from controller 21and the control instruction bits E0-E3 from these registers. The modecontrol signal is generated by controller 21 in accordance with a resultof determination as to whether the instruction is an MIMD type movementinstruction or not when the movement instruction is to be executed (whenthe MIMD type movement instruction is to be executed, the mode controlsignal S/M is set to a state for selecting the movement control datafrom the registers).

By way of example, in FIG. 49, control bit E3 is stored in C register53, control bit E1 is stored in F register 54, E0 is stored in XLregister 58, and control bit E2 is stored in D register 59. Othercombination of control bits may be stored in these registers 53, 54, 58and 59.

As the instruction for transferring (loading) the movement amount datato these registers 53, 54, 59 and 58, the load instructions shown in thelist of instructions described previously may be utilized, and thus, themovement amount data can be stored in these registers.

As described above, according to Embodiment 2 of the present invention,as the registers for storing movement data, registers not used at thetime of data moving operation among the registers provided in the ALUprocessing unit are used. Thus, the area of occupation by the circuitryfor inter-ALU movement can be reduced. Further, when the movement dataare stored, the movement control data can be stored using the registerload instruction for an SIMD operation, and therefore, programdescription for data movement control is easy.

Embodiment 3

FIG. 50 schematically shows a configuration of the ALU processingelement according to Embodiment 3 of the present invention. The ALUprocessing element shown in FIG. 50 differs from ALU processing elementaccording to Embodiment 1 in the following point. Specifically, as theregisters for applying the instruction to MIMD instruction decoder 74, Cregister 53 and XL register 58 are used, Except for this point, theconfiguration of the ALU processing element shown in FIG. 50 is the sameas that of the ALE processing element shown in FIG. 10 and therefore,corresponding portions are denoted by the same reference characters anddetailed description thereof will not be repeated.

It is noted, however, that in the configuration shown in FIG. 50 also,multiplexer 150 is provided for inter-ALE communication circuit (RECM)71, for setting the connection path of inter-ALU communication circuit71 by switching between the bits E0-E3 from data register (RECMregister) 70 and the control signal from controller 21, in accordancewith the mode control signal S/M.

In the configuration of ALE processing element 34 shown in FIG. 50, aregister used only for storing the MIMD instruction becomes unnecessary,and hence, layout area of the ALE processing element can be reduced.When registers 53 and 58 are used as the MIMD instruction registers asshown in FIG. 50, instructions for executing the MIMD operation isdescribed as follows.

MTX_MIMD(as, bs, cs, bit_count) 0: ptr.set#cs, p1; 1: men.ldC@p1++: 2:men.ldXL@P1; 3: ptr.set#as, p2; ptr. set#bs, p3; 4: for(i = 0; i <bit_count; i++) { 5: men.ldX@p2++; 6: alu.op.mimd@p3++; 7: }

In the operation description, by the instruction of line number 0, thepointer of pointer register p1 is set as the initial value cs of pointercs.

By the instruction of line number 1, the bit of the position designatedby the pointer of pointer register p1 is loaded to the C register, andthe count value of pointer register p1 is incremented by 1.

By the instruction of line number 2, the data bit at the bit positiondesignated by pointer register p1 is loaded to the XL register.

In accordance with the instruction of line number 3, the pointer ofpointer register p2 is set as the initial value as of address pointerap, and the value designated by the pointer of pointer register p3 isset as the initial value by of address pointer bp.

By the “for” sentence of line number 4, the range of variation of “i” isset within the range of 0 to bit width bit_count, and at each operation,the value i is incremented.

By the instruction of line number 5, the bit at the position designatedby the pointer of pointer register p2 is loaded to the X register, andthen, the pointer of pointer register 2 is incremented.

By the instruction of line number 6, on the data bit at the positiondesignated by the pointer of pointer register p3 and the data of Xregister, the designated MIMD operation instruction “alu. op. mimd” isexecuted in accordance with the bits stored in the C register and XLregister 58, and the result of execution is again stored at the bitposition designated by the pointer of pointer register p3.

Line number 7 indicates the end of the instruction sequence.

Therefore, when the MIMD operation is executed, by the C register 53 andthe XL register 58, operation instructions (control bits) M0 and M1 arestored, the data bit as the operation target is transferred, and anoperation with the bit at the position designated by the pointer ofpointer register p3 is executed. Here, when the operation of 1-bit basisis executed, a logic operation is executed using the X and XH registers.When the operation of 1-bit basis is executed, the data at the bitposition designated by the pointer of pointer register p3 is transferredto the XH register, and the operation is executed. When a negationinstruction NOT is to be executed, an inverting operation is executed onthe bit value of a predetermined register, among the bits stored in theXL and XH registers.

In this manner, by utilizing an instruction of a common SIMD typearchitecture, it is possible to set an MIMD instruction in the registersof each ALU processing element and to execute the operation process.

As described above, in accordance with Embodiment 3 of the presentinvention, as the registers for storing the MIMD instruction, registersfor storing operational data of the ALU processing element are used, sothat a dedicated MIMD operation instruction register becomesunnecessary. Thus, the area occupied by the ALU processing element canbe reduced.

Embodiment 4

FIG. 51 schematically shows a configuration of the ALU processingelement 34 according to Embodiment 4 of the present invention. Theconfiguration of ALU processing element shown in FIG. 51 differs fromthat of the ALU processing element shown in FIG. 50 in the followingpoint. Specifically, the movement control data bits E0-E3 for theinter-ALU communication circuit (RECM) 71 are respectively stored in Cregister 53, F register 54, XL register 58 and D register 59. Further,MIMD instruction bits M0 and M1 are stored in XL register 58 and Cregister 53.

As an example, in XL register 58, instruction bit M0 and control data EDare stored, and in C register 53, MIMD operation instruction M1 and datamovement amount control bit E3 are stored. In F register 54 and Dregister 59, movement amount control bits E1 and E2 are stored,respectively. The data movement operation and the MIMD operationalinstructions are not simultaneously executed. Therefore, collision ofdata bits does not occur even when the C register 53 and XL register 58are used for storing the MIMD operational instruction and the controlbits of zigzag copying operation.

The configuration of the ALU processing element in accordance withEmbodiment 4 shown in FIG. 51 is equivalent to the combination ofconfigurations shown in FIGS. 49 and 50. Here, it is unnecessary toprovide dedicated registers for setting the movement data amount foreach entry and for storing the MIMD instruction for each entry, so thatthe occupation area of the ALU processing element can further bereduced.

As described above, according to Embodiment 4 of the present invention,as the registers for storing the MIMD instruction and respective controlbits of the RECM data, registers provided in the ALU processing elementare utilized. Therefore, it is unnecessary to add new, further registersin the ALU processing element, and the increase in area of ALUprocessing element can be avoided. By way of example, when 1024 entriesare provided and 6 registers (2 bits for MIMD register, 4 bits for RECMregisters) are shared per one ALU processing element, a total of 6144registers can be reduced, and the area increase can effectively beprevented.

The manner of loading data and movement/instruction control data to eachregister and the manner of executing the MIND instruction are the sameas those described in Embodiment 1 above. By issuing once the zigzagcopy instruction and the MIMD operational instruction, data transfer andoperation can be executed on entry by entry basis.

Embodiment 5

FIG. 52 shows an example of a specific configuration of MIMD instructiondecoder 74 described in Embodiments 1 to 4. In the configuration shownin FIG. 52, MIMD instruction bits M0 and M1 are generated by XL register58 and C register 53, respectively. The MIMD instruction bits, however,may be stored in a dedicated MIMD register, as in Embodiment 1.

Referring to FIG. 52, MIMD instruction decoder 74 includes inverters 161and 162 receiving instruction bits M0 and M1, respectively, an ANDcircuit 163 receiving output signals of inverters 161 and 162 andgenerating a negation operation designating signal φnot, an AND circuit164 receiving an output signal of inverter 161 and instruction bit M1and generating a logical sum operation designating signal φor, an ANDcircuit 165 receiving instruction bit M0 and an output signal ofinverter 161 and generating an exclusive logical sum operationdesignating signal φxor, and an AND circuit 166 receiving instructionbits M0 and M1 and generating a logical product instruction designatingsignal φand. One operation designating signal φmimd is activated inaccordance with the logical values of instruction bits M0 and M1, and aninternal connection for executing the corresponding logic operation isset in adder 50.

The MIMD instruction decoder 74 shown in FIG. 52 is implemented by acombination circuit using inverters and AND circuits (NAND gate andinverters). By implementing the MIND instruction decoder 74 by acombination circuit, the area occupied by instruction decoder 74 can bereduced and, in addition, high-speed decoding operation becomespossible.

The configuration of combination circuit for MIMD instruction decoder 74shown in FIG. 52 is only an example, and other combination of logicgates may be used.

Embodiment 6

FIG. 53 schematically shows a configuration of MIMD instruction decoder74 according to Embodiment 6 of the present invention. In theconfiguration of FIG. 53 also, the MIMD instruction is represented byhits M1 and M0 from C register 53 and XL register 58. The MINDinstruction bits, however, may be applied from a dedicated MIMDregister.

In FIG. 53, MIMD instruction decoder 74 is formed by a multiplexer (MUX)170 that selects any of MIMD operation instructions alu.op.not,alu.op.or, alu.op.xor, and alu.op.and in accordance with instructionbits M0 and M1 and applies the selected one to adder 50.

The MIMD operation instructions applied to multiplexer 170 are eachbit-deployed and supplied in the form of a code. In accordance withcontrol bits M0 and M1, a code representing the designated operationinstruction is selected and applied to adder 50.

FIG. 54 schematically shows a specific configuration of multiplexer 170shown in FIG. 53. Referring to FIG. 54, multiplexer 170 includesselectors SEL1-SELn, each performing a 4-to-1 selection in accordancewith the MIMD instruction bits M0 and M1.

In order to generate a bit pattern of the MIMD operational instruction,an instruction pattern memory ROM is provided. The instruction patternmemory ROM is a read-only-memory, and includes memory regions MM1-MMneach having the bit width of 4 bits, provided corresponding to selectorsSEL1 to SELn. At the bit positions of the same number of memory regionsMM1 to MMn, code bit of the same MIMD operational instruction is stored.Therefore, by selectors SEL1 to SELn, values stored at the same bitpositions of these memory regions MM1 to MMn are selected in accordancewith the operation instruction bits M0 and M1, and a control patternhaving the n-bit width representing the operational instruction as a bitpattern (code) is selected and applied to adder 50. The bit width n ofthe bit pattern is set in accordance with the internal configuration ofadder 50, and the number of bits necessary for switching the signalpropagation path for realizing the designated logic operation in adder50 is used.

The instruction pattern memory ROM is provided common to the ALUprocessing elements of all the entries of the main processing circuitry.The stored values of instruction pattern memory ROM are set by maskingduring manufacturing. Therefore, when the mask value is changed whenmasking the instruction pattern memory ROM, the instructions to beexecuted as MIME) operations can easily be changed, and hence, thecontents of operation to be executed can easily be changed. Further, byextending bit width of selectors SEL1 to SELn and of memory regions MM1to MMn, extension in types of MIMD operational instructions can easilybe accommodated.

The instruction pattern memory ROM may not be a mask ROM, and it may beformed by an electrically rewritable (erasable and programmable)non-volatile memory. In that case also, the logic operation instructionscan be changed or extended easily, by electrically rewriting the storedcontents.

Embodiment 7

FIG. 55 schematically shows a configuration of MIMD instruction decoder74 in accordance with Embodiment 7 of the present invention. Referringto FIG. 55, MIMD instruction decoder 74 includes a memory 175 thatstores the MIMD operational instructions deployed in bit patterns.Memory 175 has 4 addresses in correspondence to MIMD operationinstruction (1 address has n-bit width). Memory 175 reads an operationalinstruction pattern (instruction code) of a designated address, usingoperation instruction bits M0 and M1 as an address.

Memory 175 may be any memory that allow random accessing, and a commonSRAM (Static Random Access Memory) or a flash memory may be used. Thoughnot explicitly shown in FIG. 55, memory 175 naturally has an addressdecoder for decoding instruction bits M0 and M1 and an input/outputcircuit for writing/reading the bit pattern (instruction code). Memory175 may be formed by a register file.

When an instruction set of memory 175 is changed, a code (bit pattern)of each instruction set is stored in each corresponding data entry.Here, a configuration may be adopted in which a register forserial/parallel conversion is provided in a preceding stage of the inputcircuit in memory 175, the instruction code is transferred to memory 175from the corresponding data entry by 1-bit unit, and for n-bitinstruction code of each instruction set, the instruction codetransferred in bit-serial manner is written to the corresponding addressposition in n-bit parallel manner.

Alternatively, a configuration may be adopted, in which a bus dedicatedfor MIMD instruction transfer is provided for the MIND instructiondecoder, and through such dedicated bus, each instruction code of theinstruction set is transferred through internal bus 14 shown in FIG. 1and written to memory 175 under control of controller 21. Further, aconfiguration may be used in which controller 21 generates the MIMDinstruction codes in main processing circuitry 20, and writes the codesto memory 175 for the instruction decoder of each entry.

Further, memory 175 may be formed into a 2-port configuration having Aport of 1/2 bit width (one or two bit width) and B port of n-bit width,and the instruction code may be written through A port at the time ofwriting to memory 175, and the instruction code may be read through Bport at the time of reading.

Implementation of MIMD instruction decoder 74 by a memory (RAM) 175 suchas shown in FIG. 55 provides the following effects. Specifically, byrewriting an instruction code held in memory 175, the instruction set ofusable MEND instructions can be changed even when the parallelprocessing device (MTX) is in operation.

The present invention allows, when applied to a processing device (MTX)having an SIMD type architecture executing parallel operations,execution of operations with low parallelism at high speed. Applicationof the invention is not limited to the parallel processing, and it mayalso be applied to an emulator device for a logic circuit.

Although the present invention has been described and illustrated indetail, it is clearly understood that the same is by way of illustrationand example only and is not to be taken by way of limitation, the scopeof the present invention being interpreted by the terms of the appendedclaims.

1-12. (canceled)
 13. A parallel processing device, comprising pluraloperational blocks each comprising: a data storage unit including aplurality of data entries each including a plurality of memory cellsarranged as a memory cell array and arranged corresponding to arespective entry; a plurality of arithmetic/logic processing elements,each of which couples with a corresponding entry and performs adesignated operational processing on data stored in the correspondingentry; and a plurality of data communication circuits, providedcorresponding to the entries, each performing data communication betweena corresponding entry and another entry, the plurality of datacommunication circuits each having entry-to-entry distance and directionof data movement set individually in accordance with operation kind tobe performed.
 14. The parallel processing device according to claim 13,wherein each of the data communication circuits includes a movement dataregister for storing data for setting an amount of data movement, and amultiplexer for setting a data transfer path in accordance with the datastored in the movement data register.
 15. The parallel processing deviceaccording to claim 13, wherein each of the arithmetic/logic processingelements includes a plurality of registers for storing data to beprocessed and mask data for masking the operational processing; and themovement data register is formed by a register, among the plurality ofregisters, other than the registers used in the data movement operation.16. The parallel processing device according to claim 13, whereinmovement data for setting an amount of the data movement designates onedata movement amount among a plurality of predetermined amounts of datamovement including direction of data transfer, the entries are arrangedsuccessively from an uppermost to a lowermost order, and whendestination of movement exceeds the uppermost or lowermost entry in datamovement the destination of data movement is designated in a cyclicmanner.
 17. A parallel processing device, comprising plural operationalblocks each comprising: a data storage unit having a plurality of dataentries, each data entry including a plurality of memory cells arrangedas a memory cell array and arranged corresponding to a respective entry;a plurality of arithmetic/logic processing elements each of whichcouples with a corresponding entry and performs a designated operationalprocessing on data stored in the corresponding entry; and a plurality ofdata communication circuits, provided corresponding to the entries, eachperforming data communication between a corresponding entry and anotherentry, the plurality of data communication circuits each havingentry-to-entry distance and direction of data movement set individuallyin accordance with operation kind to be performed; data for settingcontents of operational processing of each respective arithmetic/logicprocessing element and data for designating amount of data movement ofthe data communication circuit being set in an empty register among aplurality of registers storing data to be processed and mask data formasking an operational processing, provided in the respectivearithmetic/logic element.